Question

Please help with connecting the part of the header in a fasta file with another text file

0

Entering edit mode

7.8 years ago

seta ★ 1.9k

Hi everybody,

Sorry if this question is not relevant to this forum, however, I don't know anywhere to ask it. I have a fasta file containing thousands of sequences that their headers are like below:

Gene.1019::c44525_g4_i4::g.1019::m.1019 Gene.1019::c44525_g4_i4::g.1019  ORF type:complete len:339 (+) c44525_g4_i4:48-1064(+)

and in the another text file, there are two columns with hundred rows like below: (In this text file, just the part, not total, of headers from the fasta file exist)

Gene.1019  negative

I want to have the text file with the below information:

Gene.1019::c44525_g4_i4 negative

I'm familiar with Linux, but such a task for me as a biologist isn't easy. Could you please advise me with your helpful commands or tools.

Thanks in advance

programming fasta • 2.6k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 7.8 years ago by seta ★ 1.9k

0

Entering edit mode

You can use any scripting language like python here. Step1 : Open fasta file Step2 : read fasta sequence and then store Gene.XXX as a key and fasta sequence as a value of that gene in a dictionary. Step3 : Open 2nd file and read it. Step4 : Open the dictionary and see any key of that fasta sequence match with any row of 1st column, if yes then grab 2nd column and print what you want. Even you can do without making dictionary by just reading two files same time.

ADD REPLY • link 7.8 years ago by Pallab Bhowmick ▴ 20

0

Entering edit mode

I would suggest using a dictionary, definitely most straightforward.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

While pseduocode is great @seta is looking for a solution that can be used right away :-)

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Do you have any experience with (Bio)Python? Did I understand correctly that you want to keep the part of the identifier up to the second "::" and the part before the first "::" needs to match the field in the second file?

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry, I'm learning to work in this field and have no previous experience. In the example, I just have Gene.1019::c44525_g4_i4 as the identifier. Your straightforward help would be highly appreciated.

ADD REPLY • link 7.8 years ago by seta ★ 1.9k

0

Entering edit mode

Could you give me a few more examples? I'll write the script tomorrow morning (CEST).

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Sure. Here is an example of the fasta file:

>Gene.1019::c44525_g4_i4::g.1019::m.1019 Gene.1019::c44525_g4_i4::g.1019  ORF type:complete len:339 (+) c44525_g4_i4:48-1064(+)
MQRVFRTSKLLQYSFANRNSLPSSVTSPLLSNTNCVSTTSSLSSRFTGSKAPHERIPSLI
RNMPDSTSSLPLEKPVVLESNEVQHVELLNAVDDLHGGVTVDMEKPMDSKLFASALKSSL
SYWRKKGKKGVWIKLPIGFANLVEPAVQQGFRYHHAEPDYLMLVHWIPETPDTLPENASH
RVGVGAFVMNHKREVLVVQESSGKFRGTGVWKLPTGVVNEVDAEFVEVLAFRQSHQSFFR
>Gene.1022::c44533_g1_i1::g.1022::m.1022 Gene.1022::c44533_g1_i1::g.1022  ORF type:5prime_partial len:434 (+) c44533_g1_i1:2-1303(+)
IMEIPLCYRFDPTDEEIILHYLYNKTHGFPLSSSSAIIECDLYDETDAWKNFFKATRETT
LYFFTKLRKRSEKGSRVERATGSGTWRTQGDEHLYNGGDRRQHIGYKRSLTFIPKKTEEK
GKWVMTEYRLHDSLLRNKKSDLVVCRIKKKIIEEETMRNNNNSGNFVGGELIVSGFHQGM
NMGAPEVENNDYIGSNQNQERYCYPYDFGGNYNVCSENPMSGELKVPDGLQEVGMMEALE
>Gene.1128::c45331_g1_i3::g.1128::m.1128 Gene.1128::c45331_g1_i3::g.1128  ORF type:complete len:359 (+) c45331_g1_i3:289-1365(+)
MGFLQTEESKRMEIEELRRMLVACAGVNGRKDDEEFRGLRVRMDDDERRPDEVVCVTSGV
SFLGLAIVNRLLLRGYSVRIIVDNQEDIEKLREMESSGEMRTYNNNLSVVTAKLTEVEDV
VEAFEGCRGIFHTSAFVDPVGLSGYSKCMAEIEVKATESVMKACARTSSVKNCVLTSSLL

and an another text file is something like below:

Gene.1019   negative
Gene.1022   positive
Gene.1128   positive

I would like to have identifiers as following:

Gene.1019::c44525_g4_i4 negative
Gene.1022::c44533_g1_i1 positive
Gene.1128::c45331_g1_i3 positive

Thank you for your help in advance.

ADD REPLY • link 7.8 years ago by seta ★ 1.9k

2

Entering edit mode

7.8 years ago

5heikki 11k

Late to the party..

join -1 1 -2 1 -t $'\t' \
    <(awk 'BEGIN{FS="::";OFS="\t"}{if(/^>/){print substr($1,2),$2}}' file.fasta | sort -k1,1 -t $'\t') \
    <(sort -k1,1 -t $'\t' file.txt) \
    | awk 'BEGIN{OFS=FS="\t"}{print $1"::"$2,$3}' \
    > output

For big files one should add to the sort commands (if supported and enough cores): use up to 90% of RAM and 8 threads. This is for GNU sort. I'm not sure if BSD sort (e.g. in OS X) supports this. man sort to check.

-S 90% --parallel=8

awk learning curve isn't really that steep. I highly recommend it to anyone who has to deal with large text files..

ADD COMMENT • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

~~Sure this is correct? Not working for me with the small example above. No errors either.~~ @seta says it works so I am not going to worry.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Wow, looks fancy. Perhaps your solution is faster but I'd prefer python every day of the week over this :D then I don't need to be awake will coding.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks, but it's not that fancy. I wrote it as a one-liner while thinking through the problem. @genomax2, maybe you M-w C-y'd only to the pipe on the second line leaving the process substitution open?

ADD REPLY • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

Thank you, it worked well for me.

ADD REPLY • link 7.8 years ago by seta ★ 1.9k

0

Entering edit mode

7.8 years ago

WouterDeCoster 47k

This should do the trick:

Code removed, see other answer. Answer was incomplete.

Save e.g. as connectingidentifiers.py, and execute as python input.fa identifiersfile.txt > results.fa

Note:

Requires Biopython to be installed (pip install biopython)
Probably contains python (>=2.7) synthax (which can be changed if really can't be upgraded on your system)

Let me know if it doesn't work as it should!

Edit: this produces an error (explicitly) when the identifier is not present in the dictionary. But this wasn't the desired performance.

ADD COMMENT • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you, WouterDeCoster. I run it, but it gave me the error of "Aborted: Could not find Gene.4 in file of identifiers!". As I mentioned in my post, part of (not total) the headers of fasta file exist in the text file and it caused the error. Sorry, I tried to modify the code to solve the error, but I couldn't.

ADD REPLY • link 7.8 years ago by seta ★ 1.9k

0

Entering edit mode

Right, didn't catch that. I just wanted to make sure that unexpected results were gained by explicitly mentioning that error in the script. But okay :p I'll make sure it doesn't produce the error then and just writes the identifier. In addition, I've added excessive comments to my code to explain the lines.

code see other answer

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Add the correct code as a new answer (once @seta confirms that it works) so the old one can be deleted avoiding any confusion.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Will do as soon as confirmed.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

sorry, but it gave another error that "Traceback (most recent call last): File "connect.py", line 17, in <module> seq_record.id = id2 NameError: name 'id2' is not defined".

Unfortunately,although I searched on net, but I didn't understand what it means.

ADD REPLY • link 7.8 years ago by seta ★ 1.9k

1

Entering edit mode

Right, my bad. Didn't test my update. I made a rookie mistake here ;)

#Change identifiers of a fasta file (argument 1) based on additional file (argument 2)
#wdecoster

import sys
from Bio import SeqIO

with open(sys.argv[2]) as identifiersf:
    #Create a dictionary with key the first field of the file and value the second field, remove empty lines (dictionary comprehension)
    iddict = { line.split('\t')[0] : line.strip().split('\t')[1] for line in identifiersf.readlines() if not line == "" }

for seq_record in SeqIO.parse(sys.argv[1], "fasta"):
    id2 = '::'.join(seq_record.id.split('::')[:2]) #This is the part we want to keep
    seq_record.description = "" #The description of the fasta identifier is removed
    try: #use try-except because not all keys are necessarily present in the dictionary
        id1 = iddict[seq_record.id.split('::')[0]] #The key is the first part by splitting on '::'
        seq_record.id = id2 + ' ' + id1 #Add the two strings, separated by space
    except KeyError: #when not present, only the identifier is used
        seq_record.id = id2
    print(seq_record.format("fasta").strip()) #Avoid that newlines are printed by using .strip() on format("fasta")

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Finally, your last code worked well. Thank you for your help friend.

ADD REPLY • link 7.8 years ago by seta ★ 1.9k

0

Entering edit mode

Okay, I've submitted this as a final answer to this thread. Good luck with the rest of your analysis and my apologies for the mistakes which kept you waiting!

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

OK, thank you and no problem. I accepted the answer

ADD REPLY • link 7.8 years ago by seta ★ 1.9k

score 2 · Accepted Answer · 2016-07-08

Here is then the final answer :-D

#Change identifiers of a fasta file (argument 1) based on additional file (argument 2)
#wdecoster

import sys
from Bio import SeqIO

with open(sys.argv[2]) as identifiersf:
    #Create a dictionary with key the first field of the file and value the second field, remove empty lines (dictionary comprehension)
    iddict = { line.split('\t')[0] : line.strip().split('\t')[1] for line in identifiersf.readlines() if not line == "" }

for seq_record in SeqIO.parse(sys.argv[1], "fasta"):
    id2 = '::'.join(seq_record.id.split('::')[:2]) #This is the part we want to keep
    seq_record.description = "" #The description of the fasta identifier is removed
    try: #use try-except because not all keys are necessarily present in the dictionary
        id1 = iddict[seq_record.id.split('::')[0]] #The key is the first part by splitting on '::'
        seq_record.id = id2 + ' ' + id1 #Add the two strings, separated by space
    except KeyError: #when not present, only the identifier is used
        seq_record.id = id2
    print(seq_record.format("fasta").strip()) #Avoid that newlines are printed by using .strip() on format("fasta")