Extract Multiple Fasta Sequences From A File Containing Many Sequences In Lines And Rows
4
2
Entering edit mode
12.8 years ago
Chris ▴ 30

Hi, I have this problem. I have a fasta file containing the whole genome of a arabidopsis plant. I also have a list of genes I need to extract the sequences from the fasta file. Each line contains 2 or 3 or 4 sequences. I need to create a fasta file for each of the sequences in the each line. Any help? I visualize it for simplicity.

sequences I need to get fast for:

AT1G63760    AT1G05890    AT2G31510
AT3G30122    AT1G08410    AT2G27200
AT1G56390    AT1G09210    
AT1G18191    
AT3G32445    AT1G18500    AT1G74040
AT2G34760    AT1G20010    AT1G75780    AT1G74040
AT1G60480    AT1G23490    AT1G70490
AT4G10414    AT1G33230    AT4G10430
AT4G06748    AT1G52410    AT3G15950
AT2G06822    AT1G52930    AT3G15460
AT3G43820    AT1G62810    AT4G12290
AT2G11280    AT1G63640    AT5G41310
AT2G15120    AT2G15090    AT4G34250

>AT1G63760
ATGATGGATTCCGATGATGATATGCTCGATGCCCACGATATGGACTCGGTAGATTATGATTTTGACAGCGGCGGCACCGATGATGACAAC

>AT1G05890
GCAAAAGTCCATATATCTTTCTCTCATCTACTCGATTTCTGAATCGCGAACGCAGCGAGC
GATCCGGAATCGAGAGAGAGAGCTACCAATTTTTCCAACTTGTTCGGTTCCTTATAAAGC
TGTTTACTTTCCCTTGCGAATTCTCTCTCTCTCTACTATAAAGCAATACCCTTTTCTTTC
TTTTTTGGTAACCCCCAAACCCTACTTTGTCCAGCGAGAAAGGAAGAGGGGTTTCGTCTG

>AT2G31510
ATGGATTCTGAAGAAGACATGCTCGATGCGCACGATATGGAGTCTGGAGAGGATGATTTC
TACAGCGGTGGAACTGATGATTGTAATGATAGTGATGATGGTGAACCTGATTATGGGTTT
GTTGAGGAAGATGCTGATGATTCTGCTATGATCGCCTCTCATCGCTCTCAGGTGGGTTTT
TGTTTTTGTTTCTTAATTCATTTTGGTGTTGGTTGTTGAAGCTAATTGCTTTTAGAACTC

>AT1G18191
AAAATTGAAGACGAAGAGTTTAAGACTCTATCAGATCAGAAGGTCTCTATTTTCTCGAAACGTCGGCTTTAGAGAACCAAAACGTC

>AT1G56390
GCAATGGTGAAACTAGACTCTAAACTCATCTCTATGATTGTTTTCGGTATCGTGGTAATCGTCTCTGCTG

>AT1G09210
GCTAGCTCCTCTCCTCGCGGTATATATAAGCTCCAGGTCTTGTACATCTTCATCATCTGA
TCTCGGGGAAGCTCCGATCTGAGTTTTTTTTAGCAATGGCGAAAATGATTCCTAGCCTCG
TCTCTCTAATTCTTATCGGTCTTGTTGCGATCGCCTCCGCCGCAGTTATTTTCGAGGAGC
GCTTTGATGGTATCTAATTTCTACATCTCTATCTCTATACTCTATCTTCCTGATGATGCG

>AT2G34760
ATGCGTAAACAAAGTTTTAAGATAGTTATGATTCGTTTTTTGAGAGTCAATAACAAAAATTATCCGATC

>AT1G20010
AACAGAGACAATTGGTTATATTAGCTGTCACTCCCATCTTTCATATTCCTTCACCATCTC
TCTCTCTCTCGATCTTGTGAACCACTACACACACTAACACAATGAGAGAGATCCTTCACA
TTCAAGGTGGTCAATGCGGTAACCAGATTGGTTCCAAGTTCTGGGAAGTCATCTGCGACG
AGCACGGCATCGATTCCACCGGACGTTACAGTGGAGACACTGCAGATCTCCAGCTTGAAC
GTATCAATGTCTATTACAATGAAGCTTCAGGTGGAAGATACGTTCCTCGTGCTGTTCT

>AT1G75780
ATCTCCAGATCCCAAAATCTTCATCGATCATCATCATCATGAGAGAAATCCTCCACGTCC
AAGGCGGCCAATGCGGTAACCAAATCGGTTCCAAATTCTGGGAAGTTATCTGCGACGAAC
ACGGCGTTGATCCCACCGGACGTTACAACGGTGATTCCGCCGATCTTCAGCTCGAACGTA
TCAATGTTTATTACAATGAAGCTTCTGGTGGTCGTTACGTTCCTCGTGCTGTTCTCATGG

>AT1G74040
AAAGTAGTAACCAGAGACACTGTGCCGTCGCCCGTCGCCGCCGCCGCCACACTATCATCT
CTCTCAGGTTTTTGATTTTCCACGGCAATGGAGTCTTCGATTCTCAAAAGCCCTAATCTC
TCTTCACCATCGTTCGGTGTACCTTCAATTCCCGCCTTATCCTCCTCCTCCACCTCACCA
TTTTCATCTCTTCATCTCCGATCACAGAACCACCGTACCATCTCTCTTACCACCGCCGGA
AAATTCCGTGTCTCGTATTCTCTCTCCGCTTCTTCACCTCTACCACCTCATGCTCCTCGC
CGTCGTCCCAATTACATCCCTAACCGTATATCCGATCCCAATTACGTCAGAATCTT
sequence retrieval fasta • 8.8k views
ADD COMMENT
1
Entering edit mode

What have you tried so far? Have you looked at the related questions on the right?

ADD REPLY
0
Entering edit mode

Hi have script pick up a sequence per line. Not sure how to modify it so that I can have many sequences per line picked up and then saved in many files each file for 1 line of sequences.

ADD REPLY
0
Entering edit mode

!/usr/bin/perl -w

use strict;

my $idsfile = "gene1.txt"; my $seqfile = "TAIR9seq.fasta"; my %ids = ();

open (OUT,">selectionseqs.fasta");

open FILE, $idsfile; while([?]) { chomp; $ids{$_} += 1; } close FILE;

local $/ = "n>";

open FASTA, $seqfile; while ([?]) { chomp; my $seq = $_; my ($id) = $seq =~ /^>(S+)/; if (exists($ids{$id})) { $seq =~ s/^>.+n//; $seq =~ s/n//g; print "$seqn"; print "$seqn"; print OUT ">$idn$seqn"; } } close FASTA;

ADD REPLY
2
Entering edit mode
12.8 years ago
Neilfws 49k

If I understand the question correctly, you want to create a separate fasta file for each line in your list of IDs and the file will contain the number of sequences listed on the line.

You have 2 choices when trying to match an ID list to a fasta file of sequences.

  1. Get each ID, loop through the sequences, match ID to header and retrieve sequence
  2. Index the sequence file first, then loop through the IDs and retrieve each sequence

The first approach becomes very slow for many sequences, since you have to loop through all sequences for each ID. So it's generally better to index the sequence file.

Install Bioperl and index the fasta file, seqs.fa, using this command:

bp_index -dir . seqs.idx seqs.fa

Now you can open the file of IDs, ids.txt, split each line on space and use bp_fetch to pull each ID in turn from the index and write to a new fasta file. Something like this should work:

#!/usr/bin/perl -w

use strict;

open INFILE, "ids.txt";
while(<INFILE>) {
  chomp;
  my @ids = split(/\s+/);
  my $outfile = join("_", @ids).".fa";
  open OUTFILE, ">$outfile";
  foreach my $id(@ids) {
    print OUTFILE `bp_fetch -dir . seqs.idx:$id`;
  }
  close OUTFILE;
}
close INFILE

In this example, each new fasta file will be named according to the IDs on each line, separated by underscores; e.g. AT1G56390_AT1G09210.fa.

ADD COMMENT
0
Entering edit mode

Hi I tried your script and this is the error I get. Can't exec "bp_fetch": No such file or directory at script_test.pl line 16, [?] line 1. print() on closed filehandle OUTFILE at script_test.pl line 16, [?] line 1. I can not understand why as I am not so familiar with bioperl. Thanks again for your help

ADD REPLY
0
Entering edit mode

It means that the bp_fetch script is not found. When you install Bioperl, that script should be in your PATH at /usr/bin/bp_fetch. Did you actually install Bioperl?

ADD REPLY
1
Entering edit mode
12.8 years ago
Ido Tamir 5.2k

notseq from EMBOSS can read in a multi sequence file and a list of identifiers (wildcards) and write the non matching sequences to one file the matching ones to a different file (-junkout). You want the junkout file. You can specify a file with the identifiers to filter e.g. '@names.dat'.

this is the command that you need:

ruby -nae 'system("notseq -sequence fasta.txt -sformat1 fasta -osformat fasta -osname ignore.fasta -junkoutseq "+ $..to_s()+".matches.fasta -auto -exclude "+$F.join(",")) ' identifiers.txt

ADD COMMENT
0
Entering edit mode

I am not sure if this is what I want to do. I need to have for every line of IDs one fasta file so that I can make a multiple alignment later.

ADD REPLY
0
Entering edit mode

I added a ruby command that does this in a loop. Your genome goes into fasta.txt your identifiers as they are into identifiers.txt.

ADD REPLY
1
Entering edit mode
12.8 years ago
DG 7.3k

If you use Bioperl and SeqIO you can have your fasta file stored as a proper data structure (alternatively you can write your own fasta parser and store the sequences in a hash of your own).

Parsing the IDs is simple, process each line, on each line split on whitespace so you have an array of IDs. Then it is a simple matter of grabbing the sequences from the hash using the ID key and printing them to a file.

I prefer keeping my sequences in a hash data structure myself over Bio::SeqIO in most cases as I can easily pull out sequences by their ID given a list of IDs.

ADD COMMENT
0
Entering edit mode
12.8 years ago

di you try the Sequence Bulk Download and Analysis for Arabidopsis ?

http://www.arabidopsis.org/tools/bulk/sequences/index.jsp ?

ADD COMMENT
0
Entering edit mode

this will only give me a big fasta file with the ids I have in my list. So again will not do what I want.

ADD REPLY

Login before adding your answer.

Traffic: 1603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6