Biostar Beta. Not for public use.
Assigning subject to query based on coverage from blast output file
0
Entering edit mode
13 months ago
Prakki Rama ♦ 2.3k
Singapore

Dear all,

I have blast result in tabulated format. Each of the query hits multiple subject sequences. From all of those hits, I want to assign that subject sequence which covers the query sequence the most. Does any of you have a way to achieve this. I would really appreciate your ideas and resources.

EDIT: I should have been much clear in my question. I have the following file. The best hit is uti_contig0036, which covers more than "uti_contig7141" in the query.

contig996 uti_contig0036 80.4 903 110 34 7491 8352 2709010 2709886 1.00E-175 625
contig996 uti_contig0036 85.28 197 20 3 7793 7986 6423228 6423418 3.00E-46 195
contig996 uti_contig0036 76.76 327 46 15 7868 8168 6388503 6388181 2.00E-34 156
contig996 uti_contig0036 88.51 87 9 1 8203 8288 855009 855095 6.00E-19 104
contig996 uti_contig0036 93.44 61 3 1 8228 8288 3695407 3695348 2.00E-14 89.8
contig996 uti_contig0036 93.75 48 3 0 23516 23563 2295958 2295911 2.00E-09 73.1
contig996 uti_contig7141 80.22 819 97 35 7481 8280 2616628 2617400 1.00E-154 555
contig996 uti_contig7141 81.26 619 66 20 7721 8320 468207 467620 1.00E-124 455
contig996 uti_contig7141 80.25 486 67 19 7889 8352 3620643 3620165 1.00E-89 339
contig996 uti_contig7141 86.59 246 22 4 7677 7922 863436 863670 3.00E-66 261

So the required output is something like this:

" contig996" "uti_contig7141" "2170" (addition of 819,619,486,246). "contig996" "uti_contig7141" "871" (8352-7481), whereas uti_contig0036 covers only 861 bases in contig996 (8352-7491). But my worry is, If i just add up the alignment lengths, I might over exaggerate the coverage. For example, check the yellow lines, where if I cannot add the alignment length because it is already embedded in other aligned region of the same contig. I hope there is some easy way to do this. Thanks you for your ideas.

blast • 2.2k views
ADD COMMENTlink
3
Entering edit mode
13 months ago
WCIP | Glasgow | UK

Let's try again, if still relevant... The pipeline below merges the query intervals within the same subject, calculate the sum of query intervals (within subject) and picks the query with largest sum of intervals.

awk 'BEGIN{OFS="\t"}{print $1"_vs_"$2, $7, $8}' blast.txt \
| sort -k1,1 -k2,2n \
| mergeBed \
| awk '{print $0 "\t" $3-$2+1}' \
| groupBy -g 1 -c 4 -o sum \
| awk 'sub("_vs_", "\t", $1)' \
| sort -k1,1 -k3,3nr \
| awk '$1 != x {print $0} {x = $1}'

From the example data you posted the non-redundant sum of query intervals is (script above without last line):

contig996    uti_contig0036 908
contig996    uti_contig7141 871
contig996    uti_contig 300

Including the last line the output is contig996 uti_contig0036 908. Does it get closer to what you need?

mergeBed and groupBy come from bedtools

ADD COMMENTlink
0
Entering edit mode

Excellent!! This is what I exactly want. The sum of the lengths of aligned regions of the query ignoring the embedded aligned regions. Fantastic. Thank you very much for your valuable resource.

ADD REPLYlink
0
Entering edit mode

Ok, good. Make sure it is doing the right thing as I haven't tested it. You might have noticed that query and subject names are first joined and then split using "_vs_" as joiner. So if your query or subject names contain this string, things will go wrong. You can check for the presence of this string in your input with something like cut -f1,2 blast.txt | grep '_vs_'(which should return no lines)

ADD REPLYlink
1
Entering edit mode

a small correction after 4 years : awk '{print $0 "\t" $3-$2}' from above should be awk '{print $0 "\t" $3-$2+1}'. Otherwise, the length of all the alignments will be 1bp less.

ADD REPLYlink
0
Entering edit mode

Well spotted! Corrected now.

ADD REPLYlink
1
Entering edit mode
12 months ago
5heikki 8.4k
Finland

Assuming outfmt 6, alignment length is in the 4th column, so:

export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8

sort -k1,1 -k4,4gr -k12,12gr -k11,11g -k3,3gr blastout.txt | sort -u -k1,1 --merge > bestHits.txt

This sorts by 1) query, 2) alignment length, 3) bitscore, 4) evalue, 5) %identity, and then after the pipe by unique query.

ADD COMMENTlink
0
Entering edit mode

Once you mention LC_ALL, you might consider setting it to LC_ALL=C as it might make a big difference in speed. E.g. test.txt has 1 million lines:

export LC_ALL=en_US.UTF-8
time sort test.txt > /dev/null

real    0m23.471s
user    0m23.423s
sys    0m0.047s

export LC_ALL=C
time sort test.txt > /dev/null

real    0m0.959s
user    0m0.913s
sys    0m0.046s
ADD REPLYlink
0
Entering edit mode

Provided LC_ALL=C works correctly with scientific notation, this is a good notation :)

ADD REPLYlink
0
Entering edit mode

Thank you for the answer. Sorry for my unclear question. Please check my Edit.

ADD REPLYlink
0
Entering edit mode

I see. Now the problem is more complicated.

"contig996" "uti_contig7141" "2170" (addition of 819,619,486,246)

Is not correct though since those hits overlap: column 7 (query start) & 8 (query end).

ADD REPLYlink
0
Entering edit mode

Exactly!! I cannot even add. Some regions are already embedded in other regions. I just have to ignore them and just see how much of the query is covered.

ADD REPLYlink
1
Entering edit mode
12 months ago
edrezen • 720
France

Hello,

If by coverage you mean the size of the match in the query (column8-column7+1 in the tabular output), you can try the following:

cat blastresult | awk ' BEGIN{query=""; cover=0} {if (query!=$1) {if (query!="") {print align}; query=$1; cover=0;}; if ($8-$7>cover) {cover=$8-$7; align=$0;} } END{print align}' 

It should work as well with gawk if you don't have awk.

ADD COMMENTlink
0
Entering edit mode

Thank you. Apologies for my unclear question. Please check my EDIT.

ADD REPLYlink
1
Entering edit mode
13 months ago
WCIP | Glasgow | UK

What about this, using bedtools groupby:

sort -k1,1 -k2,2 blast.txt \
| groupBy -g 1,2 -c 4 -o sum \
| sort -k1,1 -k3,3nr \
| awk '$1 != x {print $0} {x = $1}'

And since it was mentioned above, set export LC_ALL=C

NB: This takes the sum of the length (column 4) regardless of possible overlap. Is it what you want?

ADD COMMENTlink
0
Entering edit mode

Thank you. But, the problem seems more complicated. Some regions are embedded in other aligned regions. I have to ignore those regions.

ADD REPLYlink
0
Entering edit mode

Check this post

ADD REPLYlink
0
Entering edit mode
5.9 years ago
Philippines

Thank You. but the problem is so very complicated.

Cheapest Car Rental in Cebu

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1