Is there any script for retrieving GO entries from PFAM GO entries
2
0
Entering edit mode
9.8 years ago
vahapel ▴ 210

Dear All,

Recently, I did de novo RNA-seq experiment for a non-model fish species and assembled transcripts for functional annotations. Before GO annotation, I blasted my assembled transcripts (1e-10) to InterProScan 5 and got PFAM IDs (File 1). Then, I downloaded GO annotations for each PFAM entries (from here: http://www.geneontology.org/page/download-mappings) and got PFAM IDs with corresponding GO mappings (File 2). I want to compare these two files and save the PFAM IDs with GO annotations for my file 1 entries:

File 1: (my file for analyses, as you notice that PF00015 and PF00016 are not absent in my file)

PF00001
PF00002
PF00003
PF00004
PF00005
PF00006
PF00007
PF00008
PF00009
PF00010
PF00011
PF00012
PF00013
PF00014
PF00017
PF00018

File 2: (downloaded file from the database PFAM IDs with GO annotations)

PF00001    GO:0004930
PF00001    GO:0007186
PF00001    GO:0016021
PF00002    GO:0004930
PF00002    GO:0007186
PF00002    GO:0016021
PF00003    GO:0004930
PF00003    GO:0007186
PF00003    GO:0016021
PF00004    GO:0005524
PF00005    GO:0005524
PF00005    GO:0016887
PF00006    GO:0005524
PF00008    GO:0005515
PF00009    GO:0003924
PF00009    GO:0005525
PF00010    GO:0046983
PF00013    GO:0003723
PF00014    GO:0004867
PF00015    GO:0004871
PF00015    GO:0007165
PF00015    GO:0016020
PF00016    GO:0000287
PF00017    GO:0005515
PF00018    GO:0005515

My question is that is there any perl or phyton script for comparing these two files and generating an output including File 1 PFAM IDs with GO annotations.

Thank you all in advance

Best regards

-vahap-

rna-seq • 3.1k views
ADD COMMENT
1
Entering edit mode
9.8 years ago
pld 5.1k

Here's an approach:

import sys
import csv

def add_safe(dict, key, val):
    try:
        dict[key].append(val)

    except KeyError:
        dict[key] = [val]

def __main__():
    infi  = sys.argv[1]
    mapfi = sys.argv[2]
    outfi = sys.argv[3]

    with open(infi, 'rb') as fi:
        raw_pfams = [x[0] for x in list(csv.reader(fi, delimiter='\t'))]

    with open(mapfi, 'rb') as fi:
        raw_map = list(csv.reader(fi, delimiter='\t'))[:-1]

    id_map = dict()
    for id in raw_map:
        add_safe(id_map, id[0], id[1])

    with open(outfi, 'w') as fi:
        for id in raw_pfams:
            try:
                fi.write(id + '\t' + ';'.join(id_map[id]) + '\n')
            except KeyError:
                print 'ERROR: {} not found in reference map!'.format(id)

__main__()
ADD COMMENT
0
Entering edit mode

Dear joe.cornish826,

Many thanks for the prompt reply. But, unfortunately, I could not execute this script :( , because I am newbie ! Do you give me a bit additional info for how to execute it ? Many thanks again.

Best regards,

-vahap-

ADD REPLY
0
Entering edit mode

Usage is as follows, assuming you save the script as pfam-to-go.py

python pfam-to-go.py <file 1 from your post> <file 2 from your post> <desired output file>
ADD REPLY
1
Entering edit mode
9.8 years ago
SES 8.6k

You should be able to do this with a simple join:

join <(sort -nk1.3,1.7 file1) <(sort -nk1.3,1.7 file2)

The sort is just to be safe, and perhaps this command could even be shortened/improved. Note that you can get at these mappings pretty easy with HMMER2GO without developing custom scripts/wrappers. Assuming you have a file of sequences to search for domains, you can find matches with:

hmmer2go search -i genes_orfs.faa -d Pfam-A.hmm

and then you can map GO terms to your matches:

hmmer2go mapterms -i genes_orfs_Pfam-A.tblout -p pfam2go -o genes_orfs_Pfam-A_GO.tsv

I didn't provide much description about what is going on with those commands, but hopefully that helps (feel free to ask questions!).

ADD COMMENT

Login before adding your answer.

Traffic: 2758 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6