Question

Extracting Sequence From A 3Gb Fasta File

30

Entering edit mode

13.9 years ago

Divya ▴ 260

Hi,
How to extract fasta sequence from an huge 3gb fasta file by giving sequence id as input using perl, Thanks in advance.

sequence retrieval fasta perl • 38k views

ADD COMMENT • link updated 6.2 years ago by andplaia • 0 • written 13.9 years ago by Divya ▴ 260

1

Entering edit mode

I've modified your original question, as it was not very clear. You should put an example of your input file and an example of your output. Is your input file a fasta file?

ADD REPLY • link 13.9 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 1 · 2010-05-25

If I read your question correctly, you have many sequences in a large file and you want to retrieve certain sequences by some ID.

One way to do this using Perl is first to index the file. If you install Bioperl and its accessory scripts, you can do this using bp_index.pl:

This example assumes that your fasta sequences are in myfile.fa in the current directory and you want to create the index file, myIndex, also in the current directory:

bp_index.pl -dir . -fmt fasta myIndex myfile.fa

You can then retrieve by ID using bp_fetch.pl. Assuming that you are in the same directory and you want the sequence with ID myID, something like:

bp_fetch.pl -dir . -fmt fasta myIndex:myID

It's been some time since I used these tools, so you should check the syntax and read up on them at the Bioperl website. In particular, it's possible to set up environment variables which point to a location with sequence indices, which can make life easier.

Ram · Answer 2 · 2010-10-20

14

Entering edit mode

13.5 years ago

Yahan ▴ 400

What about calling fastacmd through a system command?

First build the database using

formatdb -i pathtodb -p F -o T

the last option is important.

Next you can use

fastacmd -d pathtodatabase -s seqID1,seqID2,...

option -L start,stop allows to retrieve subsequence

integrating in a system() command makes it useable in perl code.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.5 years ago by Yahan ▴ 400

1

Entering edit mode

just my 2cents..

fastacmd -d pathtodatabase -i list_of_seq_ids_to_search_in_a_file.txt

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 12.9 years ago by Naga ▴ 450

0

Entering edit mode

For me it was the first place to search for it but it has limited functionality due to the 1st line format requirement. It have to be not simply fasta file. See here

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 12.3 years ago by boczniak767 ▴ 850

0

Entering edit mode

It seems that sequence header could not be in form:

GRMZM2G030710T01|PACid:19501430 When I changed it to GRMZM2G030710T01; PACid:19501430 this two commands works ok :-)

ADD REPLY • link 12.3 years ago by boczniak767 ▴ 850

Ram · Answer 3 · 2010-07-13

This Also works if you have BioPerl. You need to have the IDs that you want to extract them from fasta file in a separate file. You can use grep to generate that file.

#!/usr/bin/perl

#######################################################

#  Author: Hamid Ashrafi                              #
# email: ashrafi@ucdavis.edu
# Pupose: It reads the fasta file and another file with
# the IDs that you want to extract from the FASTA file
# then print the IDs and seqencing that are matched in
# the FASTA file.
#
#Requirement. Bio Perl
######################################################



use strict;
use Bio::DB::Fasta;

my $database;
my $fasta_library;
my %records;
open IDFILE, "NBS_LRR_IDs.txt" or die $!;
open OUTPUT, ">NC_003070_NBS-LRR.faa" or die $!;
#  name of the library file - (here it is hardcoded)
$fasta_library = 'NC_003070_all_chr.faa';

# creates the database of the library, based on the file
$database = Bio::DB::Fasta->new("$fasta_library") or die "Failed to creat Fasta DP object on fasta library\n";


# now, it parses the file with the fasta headers you want to get
while (<IDFILE>) {



      my ($id) = (/^>*(\S+)/);  # capture the id string (without the initial ">")
      my $header = $database->header($id);
      #print "$header\n";
      print ">$header\n", $database->seq( $id ), "\n";
      print OUTPUT  ">$header\n", $database->seq( $id ), "\n";
}

exit;

Ram · Answer 4 · 2010-05-25

Actually, I use this perl script to extract some genes from genome, but I don't know it would be working in your case.

#!/usr/bin/perl

use Bio::Perl;
use Bio::SeqIO;
use IO::String;
use Bio::SearchIO;

#you can input you genes' name in the array or read it from other files.

my @genes_name=qw(FGSG_01562 FGSG_02646 FGSG_05604 FGSG_05896 FGSG_06871 FGSG_07409  FGSG_07952 FGSG_08962 FGSG_09852 FGSG_09945 FGSG_10286 FGSG_11792);

my $filename='fusarium_graminearum_3_proteins.fasta';

my $gb = Bio::SeqIO->new(-file   => "<$filename",
                              -format => "fasta");
my $fa = Bio::SeqIO->new(-file   => ">gball.fa",
                              -format => "fasta",
                              -flush  => 0); # go as fast as we can!

while($seq = $gb->next_seq) {

    #Sorry! Here would be with problem, if we use this "if (grep {$_=$seq->id} @genes_name;"

    $fa->write_seq($seq) if (grep {$_ eq $seq->id} @genes_name);

}

Ram · Answer 5 · 2010-05-26

If you can't install BioPerl, this vanilla Perl should work quite fast if you expect only one exact ID match:

#!/usr/bin/perl

# usage:  extractSeqByID.pl SEQ123 < huge.fsa > my.fsa

use warnings;
use strict;

my $lookup = shift @ARGV;  # ID to extract

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
    chomp $seq;
    my ($id) = $seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
    if ($id eq $lookup) {
        $seq =~ s/^>*.+\n//;  # remove FASTA header
        $seq =~ s/\n//g;  # remove endlines
        print "$seq\n";
        last;
    }
}

Ram · Answer 6 · 2012-09-12

6

Entering edit mode

11.6 years ago

AGS ▴ 250

Simplest and I'd wager the fastest method:

Use Jim Kent's utils.

Awesome Tools

Specifically, faSomeRecords:

---> faSomeRecords 
faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa
options:
 -exclude - output sequences not in the list file.

Or, if it really is only one sequence you are after, use faOneRecord

---> faOneRecord
faOneRecord - Extract a single record from a .FA file
usage:
 faOneRecord in.fa recordName

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 11.6 years ago by AGS ▴ 250

0

Entering edit mode

The faSomeRecord is truly amazing.. Don't require to install BIOpython and has worked so rapidly, that I initially believe that it did not work at all

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 8.2 years ago by Antonio R. Franco ★ 5.1k

Ram · Answer 7 · 2011-01-27

I needed to do the same thing, but I want to use a file with all the IDs in it, that is taken as an argument of the program (at the end I want to create automatically IDfiles and to a pipeline)

I am a perl beginner, so it might be not clean, but I modified the script above and in my hands it is working.

And, also, thank you for this script, H. Ashrafi!

#!/usr/bin/perl

#######################################################
# Author: Hamid Ashrafi                              
# email: ashrafi@ucdavis.edu
# Pupose: It reads the fasta file and another file with the IDs that you want to extract from the FASTA file
# then print the IDs and seqencing that are matched in the FASTA file.
#
# Requirement. Bio Perl
#
# Modified by Aurelie K
#   -> put the names of files in arguments (and also use the STDOUT for the output file)
#   -> suppr index file generated during the script
######################################################

#usage = path_to/fasta_getseq.pl path_to/fasta_library path_to/IDFILE > log

if ($#ARGV != 1) {
 print "usage:  fasta_getseq.pl IDFILE fasta_library > outputfile\n";
 exit;
}

use strict;
use Bio::DB::Fasta;

my $database;
my $fasta_library = $ARGV[1];   #path to fastalibrary in the second argument
my %records;

open IDFILE, "<$ARGV[0]" or die $!; #first argument is the path of the file containing all the IDs you need to extract
open OUTPUT, <STDOUT>;

# creates the database of the library, based on the file
$database = Bio::DB::Fasta->new("$fasta_library") or die "Failed to creat Fasta DP object on fasta library\n";

# now, it parses the file with the fasta headers you want to get
while (<IDFILE>) {

  my ($id) = (/^>*(\S+)/);  # capture the id string (without the initial ">")
  my $header = $database->header($id);
  #print "$header\n";
  print ">$header\n", $database->seq( $id ), "\n";
  print OUTPUT ">$header\n", $database->seq( $id ), "\n";
}

#remove the index file that is useless for user
unlink "$fasta_library.index";

#close the filehandles
close IDFILE;
close OUTPUT;
exit;

Ram · Answer 8 · 2011-02-25

I use the following codes to extract fastas from a database and print each of my targets into different fastas.

#!/usr/bin/perl 
use Bio::Perl;
use Bio::SeqIO;
 use IO::String;
 use Bio::SearchIO;
 open TF,'File_contain_target_fasta_IDs.txt' 
 or die "Couldn't open this file: $!";   
 chomp (@genes_name=<TF>); 
 my $filename='Fasta_database.fasta';

      my $gb = Bio::SeqIO->new(-file   => "<$filename",
                                -format => "fasta");
     while($seq = $gb->next_seq) { 
        my @temp_array=grep {$seq->id eq $_}@genes_name;
        #this step is import for correct output
      foreach (@temp_array){
      #Only one element is given to $string, which is the key point for writing single sequence into one fasta file
         my $string=$_;
         #print $string,"\n";
         my $stringio = IO::String->new($string);
         #If the foreach function is not used here, the defaut $_, actually, here would be many $_ been transfered into $string, and this is resulted in wrong sequences written into one fasta file.
           my $out = Bio::SeqIO->new(-fh => $stringio,
                                     -format => 'fasta');
           # output goes into $string
           $out->write_seq($seq);
           # modify $string
           #$string =~ s|(>)(\w+)|$1$2|g;
           # print into STDOUT
           #print $string;  
           #subrountin for writing sequences into separated fasta file
           write_into_fasta($seq->id,$string);
                      }
        }

sub write_into_fasta {
    my $fh=$_[0];
    open $fh,">$fh.fasta" or die "Couldn't open this file: $!";
    print $fh $_[1],"n";
  close $fh;
}

score 1 · Answer 9 · 2011-05-29

in terms of performance, from all open-source options that I've tried, the best so far has been cdbfasta:
http://sourceforge.net/projects/cdbfasta/
You first create an index, and then put all your sequence ids in a text file, and use cdbyank to fetch them like this:

/path/to/cdbyank big_file.fasta.cidx < list_of_ids.txt -o my_ids.fasta

score 0 · Answer 10 · 2010-10-20

0

Entering edit mode

13.5 years ago

Manisha • 0

but if i have multiple sequences to be extracted then what should i do? if i put all sequence id in one text file and give it as input then the above program is not working

ADD COMMENT • link 13.5 years ago by Manisha • 0

2

Entering edit mode

please, ask a new question for this problem, citing this post.

ADD REPLY • link 13.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I edited my answer

ADD REPLY • link 12.9 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

I've written a tool (called xcerpt) which takes a list of read ID's, and then scans a fasta file to extract them. No indexing or other work necessary, but of course, every time you call it, it will scan the whole file until all requested seqences are found (3GB ~ half a minute). http://hackage.haskell.org/package/clustertools

ADD REPLY • link 12.9 years ago by Ketil 4.1k

Ram · Answer 11 · 2011-05-29

I have changed a few of cheng zhongshan's script.just added to read from a file where i expect the ids should by separated by new line. here it goes:

#!/usr/bin/perl -w
use Bio::Perl;
use Bio::SeqIO;
use IO::String;
use Bio::SearchIO;

my $input = shift;
my $id_file = shift;
my $output = shift;
my @genes_name;

open(FILE,$id_file) or die "Can't open $id_file";
while (<FILE>) {
my $id =$_;
chomp($id);
push @genes_name, $id;
}

my $gb = Bio::SeqIO->new(-file   => "<$input",
                              -format => "fasta");

my $fa = Bio::SeqIO->new(-file   => ">$output",
                              -format => "fasta",
                              -flush  => 0); # go as fast as we can!

while($seq = $gb->next_seq) { 
 #Sorry! Here would be with problem, if we use this "if (grep {$_=$seq->id} @genes_name;"
$fa->write_seq($seq) if (grep {$_ eq $seq->id} @genes_name);
}

Ram · Answer 12 · 2011-05-30

0

Entering edit mode

12.9 years ago

Ryan Thompson ★ 3.6k

It's not Perl, but pyfasta is really easy to use for this purpose. From the help text:

Usage: extract some sequences from a fasta file. e.g.:
               pyfasta extract --fasta some.fasta --header at2g26540 at3g45640

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 12.9 years ago by Ryan Thompson ★ 3.6k

Ram · Answer 13 · 2012-09-12

0

Entering edit mode

11.6 years ago

Ketil 4.1k

The easiest is probably to use 'agrep':

agrep -d '^>' regex

will extract all sequences matching the regular expression regex. If you have a rough idea of sequence length, and you don't have agrep installed, you can use -A option to grep to look for the header line and extract a fixed number of lines after it. Or you can use awk:

awk 'BEGIN { RS=">" } /regex/ { print }'

which is equivalent to the agrep command.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 11.6 years ago by Ketil 4.1k

0

Entering edit mode

This code, although interesting, is risky to use. First, it removes the '>' in front of the sequence name. Second, it matches anything between the record markers, not only the name. What if my sequence name is ATT and ATT is found in many of the sequences or, worst, one of the sequences is ATT? Third, unless you specify a beginning of line and end of line in your regex, it will also recover partial matches (like getting sequence222 and sequence22 when searching sequence2).

ADD REPLY • link 10.8 years ago by Eric Normandeau 11k

1

Entering edit mode

for i in `cat name.txt`;do awk 'BEGIN{RS=">"} /'$i'/{print ">" $0}' test.fas;done

This code can be used for multiple name searching.

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 10.4 years ago by mayingfei ▴ 10

score 0 · Answer 14 · 2018-01-31

Hi all, so I know this post is from a long time ago, but I,ve been searching for an easier way to do this, and in the original question he says he wants to use perl... but there's the, not that new, easel miniapps that comes with HMMER, first you need to index your fasta with

 esl-sfetch --Index seq_file.fasta

and then you simply do this command

 esl-sfetch seq_file.fasta "headeryouwanttoretrieve"

or from a list file with the headers of the sequences you want (with no ">")

 esl-sfetch -f seq_file.fasta myhits.list > myhits.fasta

you can see more of this in here:

https://cryptogenomicon.org/2011/05/03/extracting-hmmer-results-to-sequence-files-easel-miniapplications/