Challenge: Convert GenBank to Fasta without bioperl, without emboss, or any other dependencies
2
0
Entering edit mode
8.5 years ago
Lee Katz ★ 3.1k

Here is a possibly difficult challenge. Or maybe not :)

How would I convert a GenBank file to a Fasta file with only the basic Linux tools at my disposal? These would be Perl 5.10, Bash, Awk, Python 2.6, etc

genbank fasta • 5.8k views
ADD COMMENT
2
Entering edit mode
ADD COMMENT
0
Entering edit mode

Thank you! I could not actually find this script with my google-fu!

ADD REPLY
0
Entering edit mode

Doesn't work though :-/. Looks like it was hard coded to work with small genbank files, not larger ones like chromosomes.

ADD REPLY
1
Entering edit mode
8.5 years ago
SES 8.6k

Challenge accepted. Here is the most minimal example I could come up with:

It runs with any Perl (I tested 5.6 and 5.20) and it's pretty fast too. Well, it is as fast as the sed script and much faster than bioperl (7 sec. for human chr 1 vs. >90 sec. with Bio::SeqIO). I should note that the sed script doesn't parse all GenBank files correctly (like long sequences), and it won't warn if there is no sequence for the record. I guess the lesson there is that it usually isn't worth saving a few seconds of run time if you have to debug later. For a one-off, I might do this but it was mostly for fun while I wait for a job to finish. For a pipeline that is important, and where you need more than just the sequence, I would probably use bio* something to handle all the corner cases.

Usage:

perl biostars63769.pl < sequence.gb > sequence.fas

And, this was against the rules, but I have to say, bioperl is tough to beat (though the above will obviously be way faster):

perl -MBio::SeqIO -e '$seqin = Bio::SeqIO->new(-fh => \*STDIN, -format => 'genbank'); $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'fasta'); while ($seqobj = $seqin->next_seq) { $seqout->write_seq($seqobj) }' < sequence.gb > sequence.fas
ADD COMMENT
0
Entering edit mode

Looks good! But just visually, it looks like it doesn't handle multifasta files. Probably just a small fix?

ADD REPLY
0
Entering edit mode

Maybe this is a good fix?

ADD REPLY
0
Entering edit mode

Do you mean multiple sequences per genbank file/record? It looks like it would be easy to add, as you did, but this is one reason I would probably use a Bio* library in practice because messing with genbank files can get complicated given that there are so many variations. This change is not complicated but it is hard to write a simple parser that works on all genbank files.

ADD REPLY
0
Entering edit mode

Exactly. Most of my genbank files have multiple entries.

ADD REPLY

Login before adding your answer.

Traffic: 2671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6