Extracting Gene Ids From Protein Ids
3
1
Entering edit mode
12.3 years ago
Assa Yeroslaviz ★ 1.8k

hi,

I have a tab-delimited table of of protein ids that looks like that:

45    FBpp0070037    
46    FBpp0070039;FBpp0070040    
47    FBpp0070041;FBpp0070042;FBpp0070043    
48    FBpp0070044;FBpp0110571    
...

For each of these protein Ids I would like to extract the gene id (Fbgn....) in a third column. the output table should looks like that:

45    FBpp0070037                          FBgn001234  
46    FBpp0070039;FBpp0070040              FBgn00094432;FBgn002345   
47    FBpp0070041;FBpp0070042;FBpp0070043  FBgn0001936;FBgn000102;FBgn004527   
48    FBpp0070044;FBpp0110571              FBgn0097234;FBgn00183   
...

I was thinking using biomaRt, but I could find a way of automating it for the complete protein ids in the line

I would appreciate your Ideas.

Thanks A.

r conversion biomart • 3.2k views
ADD COMMENT
5
Entering edit mode
12.3 years ago
Rm 8.3k

Flybase: FBgn <=> FBtr <=> FBpp IDs (fbgn_fbtr_fbpp_*.tsv)

input_file.txt:

FBpp0070037    
FBpp0070039;FBpp0070040    
FBpp0070041;FBpp0070042;FBpp0070043    
FBpp0070044;FBpp0110571

cat input_file.txt | while read LINE; do echo -en "$LINE\t" >> out_fbpp2fbgn.txt ; fbpp="$(echo $LINE | cut -d";" -f1)" ; grep "$fbpp" fbgn_fbtr_fbpp_fb_2011_10.tsv |  awk ' BEGIN {OFS = FS = "\t"}{print$1}' >>out_fbpp2fbgn.txt; done

out_fbpp2fbgn.txt:

FBpp0070037     FBgn0010215
FBpp0070039;FBpp0070040 FBgn0052230
FBpp0070041;FBpp0070042;FBpp0070043     FBgn0000258
FBpp0070044;FBpp0110571 FBgn0053217

ftp://ftp.flybase.net/releases/current/precomputed_files/genes/fbgn_fbtr_fbpp_fb_2011_10.tsv.gz

ADD COMMENT
0
Entering edit mode

the file is good, but not exactly what I was looking for. thanks.

ADD REPLY
0
Entering edit mode

can you be more specific

ADD REPLY
0
Entering edit mode

my problem is not to get the data from biomaRt, but to get it and keep the structure of the table. If I'll run the column as one ID per line, I will have it than difficult to bring the IDs back to their right protein ID.

ADD REPLY
0
Entering edit mode

see the edited answer: to include your request

ADD REPLY
1
Entering edit mode
12.2 years ago
scapella ▴ 390

Hi,

I'd suggest (see below) to use a python script to do the parsing. The code works accordingly to you have said and it removes duplicates entries IDs in the same line.

Hope this can help you.

from string import strip

for line in open(inFile, "rU"):
  fields = map(strip, line.split())

  ids = map(strip, fields[1].split(";"))
  genes = [singleID for singleID in ids if singleID.startswith("Fbgn")]
  others = set(ids) - set(genes)

  print ("%s\t%s\t%s") % (fields[0], ";".join(sorted(others)), ";".join(sorted(genes)))
ADD COMMENT
1
Entering edit mode
12.2 years ago

You can download the index file that RM suggested. It has three columns where first column is the gene ID and third column is the protein ID. Then use a python script to use the index file to transform your protein ID list. So something like this:

import sys

indexFile = open(sys.argv[1],'r')
index = {}
for line in indexFile:
    if line[0] != "#" and line != "":
        data = line.strip().split('\t')
        if len(data) > 2:
            index[data[2]] = data[0]

pidFile = open(sys.argv[2],'r')

for line in pidFile:
    data = line.strip().split(';')
    output = ''
    for item in data:
        output += index[data[0]] + ';'
    print line.strip() + "\t" + output[:-1]

Save as myScript.py. Use by: python myScript.py indexFile proteinIDFile

ADD COMMENT

Login before adding your answer.

Traffic: 2601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6