Scripting solution to generate a list of KEGG ORTHOLOGY (KO) terms from a tab-delimited annotation file
2
0
Entering edit mode
6.6 years ago
jvire1 ▴ 10

Does anyone happen to know a basic scripting (perhaps awk or python) approach to extracting KEGG orthology terms from a tab delimited annotation file?

The file in question has rows that look look like this:

TRINITY_DN18877_c0_g1_i1    KEGG:zma:103654828`KEGG:zma:103654829`KEGG:zma:542341`KO:K02995
TRINITY_DN6301_c0_g1_i1     KEGG:zma:103647201`KO:K10798
TRINITY_DN12892_c3_g5_i1    KEGG:zma:103643875
TRINITY_DN13158_c1_g2_i35   KEGG:vvi:100249085`KO:K02435

What I'm ultimately needing is to extract the transcript ID in column one and the ko terms in column two. Like this:

TRINITY_DN6301_c0_g1_i1     K10798

The end goal is to use the list with KEGG Mapper (http://www.kegg.jp/kegg/tool/map_pathway.html) to see what KEGG pathways are present and most abundant in my transcriptome assembly.

RNA-Seq • 2.1k views
ADD COMMENT
2
Entering edit mode
6.6 years ago
awk '{n=split($2,a,/`/);for(i=1;i<=n;++i) if(substr(a[i],1,3)=="KO:") printf("%s %s\n",$1,substr(a[i],4));}' input.txt
TRINITY_DN18877_c0_g1_i1 K02995
TRINITY_DN6301_c0_g1_i1 K10798
TRINITY_DN13158_c1_g2_i35 K02435
ADD COMMENT
0
Entering edit mode

Thank you! Worked like a charm.

-James

ADD REPLY
0
Entering edit mode
6.6 years ago
Sparrow_kop ▴ 260

In python, I assume the delimiter is tab

with open('your_file','r') as f:
    for line in f:
        if 'KO:' in line:
            line = line.strip().split('\t')
            print(line[0] + '\t' + line[1].split(':')[-1])
ADD COMMENT

Login before adding your answer.

Traffic: 1564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6