Extract sequence number "X" from multifasta file
4
1
Entering edit mode
9.4 years ago
user230613 ▴ 360

Hi there,
I know there are different ways to extract some "wanted" sequences from multifasta file. In general, having one or more IDs, you can retrieve the sequence from multifasta file using bioperl script or another kind of software as ones pointed out here (Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences). In my case I know that I want the sequence number 1, 5 and 10 from a multifasta file. Any idea how can I achieve this using perl/bioperl preferably?

Thanks in advance.
P.D: I'm a beginner in bioinformatics word, sorry if it is a very basic question.

fasta • 3.7k views
ADD COMMENT
3
Entering edit mode
9.4 years ago
SES 8.6k

Here is a BioPerl example since you asked:

I put this in a script so it is easier to read, but in practice, I would just do this at the command line with a one-liner (that's how I wrote it):

perl -MBio::SeqIO -M'List::MoreUtils qw(any)' -E '@ids = (1, 5, 10); $num = 0; $seqio = Bio::SeqIO->new(-fh => \*STDIN); while ($seq = $seqio->next_seq) { $num++; say join "\n", ">".$seq->id, $seq->seq if any { $_ == $num } @ids; }' < seqs.fas

That will save you from opening an editor. It's not as readable but it works the same. :)

ADD COMMENT
0
Entering edit mode

Thank you a lot SES. One last question, could you explain me a little bit the meaning of this last two lines?:

say join "\n", ">".$seq->id, $seq->seq
if any { $_ == $num } @ids;

And another thing, using "say" is possible to print to a file like with "print?

Thanks again.

ADD REPLY
1
Entering edit mode

say was added in perl version 5.10, it just adds a newline to your print, so it saves you some typing. If you have something as old as 5.10, you can enable it with use feature 'say'; or you can enable all new features at the command line with -E.

join should be self-explanatory, it just adds a newline after the sequence ID and the sequence itself. In Perl, statements are terminated by a semi-colon, so that last statement is one line. That style is called post-fix, with the conditional at the end, and this makes the code much more readable (esp. when put on separate lines) by removing unnecessary braces. That statement should be easy to read, it says, "print the record if any IDs match those in the list." And that sentence is almost working code! That is a good reason to use the functions from List::Util or List::MoreUtils, they are very expressive and they make the code easy to read (and they are fast, as they are all written in C).

Note that I imported the any function from the package List::MoreUtils. All of the functions in that package are now part of List::Utils, which is shipped with Perl (the lastest version includes these functions). I used the List::MoreUtils package so the code will work with any version of Perl, not just the latest. I think it will simplify things going forward now that all the functions are in one package and in core Perl.

ADD REPLY
1
Entering edit mode
9.4 years ago

There is a program in BBTools which will do that:

getreads.sh in=data.fasta out=filtered.fasta id=1,5,10
ADD COMMENT
1
Entering edit mode
9.4 years ago

I want the sequence number 1, 5 and 10 from a multifasta file

awk '/^>/ {N++;} {if(N==1 || N==5 ||N==10) print;}' in.fa
ADD COMMENT
0
Entering edit mode

I might be mistaken, but I think this only prints the lines containing the sequence identifiers, correct?

ADD REPLY
0
Entering edit mode

No, there is no 'next' statement in

{N++; }

so awk continues to scan the matching patterns. It would only print the headers with

{N++; next;}
ADD REPLY
1
Entering edit mode

Thank you Pierre. I'm still impressed by how concise most awk scripts can be.

ADD REPLY
1
Entering edit mode
9.4 years ago

I know you are looking for a Perl solution, but you could use my pyfaidx module for this. In the latest version I've added numeric indexing, so you can do:

ADD COMMENT

Login before adding your answer.

Traffic: 2132 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6