collecting 50 most frequent proteins from tabular blastX result
0
0
Entering edit mode
7.5 years ago
Farbod ★ 3.4k

Dear Friends, Hi

I have done a blastX against NCBI nr database (using Diamond and keeping -max_target_seqs = 1) with outfmt 6.

I want to collect 50 proteins with the most frequent occurance in my results.

Is there any command line sccript or program for doing this task?

(I have tried cutting the column of the IDs and then openning it in Microsoft excel and count the duplicates and . . . but opening such file and running the duplicate count in my Windows system computer which is not very powerful is very difficult)

Thank you in advance

blast • 1.6k views
ADD COMMENT
1
Entering edit mode

Perhaps this would help (see @Pierre's answer or python scripts if that is not going to help): Blastp how to find and count duplicates?..

ADD REPLY
0
Entering edit mode

Dear genomax2, Hi & thank you.

but I could not understand that what is the final correct python script ?

ADD REPLY
1
Entering edit mode

Simple: cut -f 1 blast_out.tbl | sort | uniq -c | sort -k1gr |head -50

ADD REPLY
0
Entering edit mode

Dear Asef, Hi

It seems that it is magically working!

Thank you

ADD REPLY
2
Entering edit mode

No magic, just simple unix command liners

ADD REPLY
0
Entering edit mode

Dear Asef,

it seems that your script has two sort commands in it, can we reduce it to just one ?

~ Best

ADD REPLY
0
Entering edit mode

Probably not. You can start at left and keep running the commands, every-time adding one more term (from the pipes) to see why not.

 cut -f 1 blast_out.tbl | less
 cut -f 1 blast_out.tbl | sort | less
 cut -f 1 blast_out.tbl | sort | uniq -c | less

You get the idea.

ADD REPLY

Login before adding your answer.

Traffic: 1391 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6