Parsing MEGAN blast2lca output
0
0
Entering edit mode
5.2 years ago
qinglong ▴ 10

Hi there,

I am seeking help from the community for parsing blast2lca output from MEGAN version 6:

blast2lca output (semi-colon separated; I have millions of genes in a txt file; 1001, 10010, 10011 etc. are gene ID):

1001; ;g__Bacteroides; 100;s__Bacteroides caccae; 21;
10010; ;g__Clostridium; 100;s__Clostridium butyricum; 50;
10011; ;g__Clostridium; 100;s__Clostridium butyricum; 75;
...
...

Here is what I want (set cut-off to 50; stored in a tab-delimited file):

GeneID  Genus           Species
1001    Bacteroides 
10010   Clostridium Clostridium butyricum
10011   Clostridium Clostridium butyricum
...
...

Could you please provide me a command or simple script to do this ?

Much appreciated!

Qinglong

lowest common ancestor taxonomic annotation MEGAN6 • 1.6k views
ADD COMMENT
1
Entering edit mode

if input and output are exactly same as OP, you can have this way:

test.txt is input and output is:

$ awk -v OFS="\t" -F '[;__]' 'NR==1 {print "GeneID", "Genus","Species"};{print $1,$5,$9}' test.txt       

GeneID  Genus   Species
1001    Bacteroides Bacteroides caccae
10010   Clostridium Clostridium butyricum
10011   Clostridium Clostridium butyricum

input:

$ cat test.txt 
1001; ;g__Bacteroides; 100;s__Bacteroides caccae; 21;
10010; ;g__Clostridium; 100;s__Clostridium butyricum; 50;
10011; ;g__Clostridium; 100;s__Clostridium butyricum; 75;
ADD REPLY
0
Entering edit mode

Thanks!!! But I also need to have a filtering based on the confidence score (cut-off: 50), do you have any other command to do that?

ADD REPLY
1
Entering edit mode
 $ awk -v OFS="\t" -F '[;__]' 'NR==1 {print "GeneID", "Genus","Species"}; { if ($10<=50) print $1,$5,$9,$10}' test.txt  
GeneID  Genus   Species
1001    Bacteroides Bacteroides caccae   21
10010   Clostridium Clostridium butyricum    50
ADD REPLY

Login before adding your answer.

Traffic: 1489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6