Extracting Sequence From A 3Gb Fasta File
14
30
Entering edit mode
13.9 years ago
Divya ▴ 260

Hi,
How to extract fasta sequence from an huge 3gb fasta file by giving sequence id as input using perl, Thanks in advance.

sequence retrieval fasta perl • 38k views
ADD COMMENT
1
Entering edit mode

I've modified your original question, as it was not very clear. You should put an example of your input file and an example of your output. Is your input file a fasta file?

ADD REPLY
27
Entering edit mode
13.9 years ago
Neilfws 49k

If I read your question correctly, you have many sequences in a large file and you want to retrieve certain sequences by some ID.

One way to do this using Perl is first to index the file. If you install Bioperl and its accessory scripts, you can do this using bp_index.pl:

This example assumes that your fasta sequences are in myfile.fa in the current directory and you want to create the index file, myIndex, also in the current directory:

bp_index.pl -dir . -fmt fasta myIndex myfile.fa

You can then retrieve by ID using bp_fetch.pl. Assuming that you are in the same directory and you want the sequence with ID myID, something like:

bp_fetch.pl -dir . -fmt fasta myIndex:myID

It's been some time since I used these tools, so you should check the syntax and read up on them at the Bioperl website. In particular, it's possible to set up environment variables which point to a location with sequence indices, which can make life easier.

ADD COMMENT
0
Entering edit mode

Hi Neilfws,

I used below cmd to extract RNA seq from fasta file using ID number from text file:

bp_index.pl -dir . -fmt fasta myIndex ./nr_latest.fasta
time bp_fetch.pl -dir . -fmt fasta myIndex:sequence.gi.txt

### Error: Sequence sequence.gi.txt in Database myIndex is not present.

Can you please give me any suggestion.

I would really appreciate your help.

Thanka in advance!

Naresh

ADD REPLY
14
Entering edit mode
13.5 years ago
Yahan ▴ 400

What about calling fastacmd through a system command?

First build the database using

formatdb -i pathtodb -p F -o T

the last option is important.

Next you can use

fastacmd -d pathtodatabase -s seqID1,seqID2,...

option -L start,stop allows to retrieve subsequence

integrating in a system() command makes it useable in perl code.

ADD COMMENT
1
Entering edit mode

just my 2cents..

fastacmd -d pathtodatabase -i list_of_seq_ids_to_search_in_a_file.txt
ADD REPLY
0
Entering edit mode

For me it was the first place to search for it but it has limited functionality due to the 1st line format requirement. It have to be not simply fasta file. See here

ADD REPLY
0
Entering edit mode

It seems that sequence header could not be in form:

GRMZM2G030710T01|PACid:19501430 When I changed it to GRMZM2G030710T01; PACid:19501430 this two commands works ok :-)

ADD REPLY
8
Entering edit mode
13.8 years ago

This Also works if you have BioPerl. You need to have the IDs that you want to extract them from fasta file in a separate file. You can use grep to generate that file.

#!/usr/bin/perl

#######################################################

#  Author: Hamid Ashrafi                              #
# email: ashrafi@ucdavis.edu
# Pupose: It reads the fasta file and another file with
# the IDs that you want to extract from the FASTA file
# then print the IDs and seqencing that are matched in
# the FASTA file.
#
#Requirement. Bio Perl
######################################################



use strict;
use Bio::DB::Fasta;

my $database;
my $fasta_library;
my %records;
open IDFILE, "NBS_LRR_IDs.txt" or die $!;
open OUTPUT, ">NC_003070_NBS-LRR.faa" or die $!;
#  name of the library file - (here it is hardcoded)
$fasta_library = 'NC_003070_all_chr.faa';

# creates the database of the library, based on the file
$database = Bio::DB::Fasta->new("$fasta_library") or die "Failed to creat Fasta DP object on fasta library\n";


# now, it parses the file with the fasta headers you want to get
while (<IDFILE>) {



      my ($id) = (/^>*(\S+)/);  # capture the id string (without the initial ">")
      my $header = $database->header($id);
      #print "$header\n";
      print ">$header\n", $database->seq( $id ), "\n";
      print OUTPUT  ">$header\n", $database->seq( $id ), "\n";
}

exit;
ADD COMMENT
6
Entering edit mode
13.9 years ago

Actually, I use this perl script to extract some genes from genome, but I don't know it would be working in your case.

#!/usr/bin/perl

use Bio::Perl;
use Bio::SeqIO;
use IO::String;
use Bio::SearchIO;

#you can input you genes' name in the array or read it from other files.

my @genes_name=qw(FGSG_01562 FGSG_02646 FGSG_05604 FGSG_05896 FGSG_06871 FGSG_07409  FGSG_07952 FGSG_08962 FGSG_09852 FGSG_09945 FGSG_10286 FGSG_11792);

my $filename='fusarium_graminearum_3_proteins.fasta';

my $gb = Bio::SeqIO->new(-file   => "<$filename",
                              -format => "fasta");
my $fa = Bio::SeqIO->new(-file   => ">gball.fa",
                              -format => "fasta",
                              -flush  => 0); # go as fast as we can!

while($seq = $gb->next_seq) {

    #Sorry! Here would be with problem, if we use this "if (grep {$_=$seq->id} @genes_name;"

    $fa->write_seq($seq) if (grep {$_ eq $seq->id} @genes_name);

}
ADD COMMENT
6
Entering edit mode
13.9 years ago
Hanif Khalak ★ 1.3k

If you can't install BioPerl, this vanilla Perl should work quite fast if you expect only one exact ID match:

#!/usr/bin/perl

# usage:  extractSeqByID.pl SEQ123 < huge.fsa > my.fsa

use warnings;
use strict;

my $lookup = shift @ARGV;  # ID to extract

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
    chomp $seq;
    my ($id) = $seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
    if ($id eq $lookup) {
        $seq =~ s/^>*.+\n//;  # remove FASTA header
        $seq =~ s/\n//g;  # remove endlines
        print "$seq\n";
        last;
    }
}
ADD COMMENT
6
Entering edit mode
11.6 years ago
AGS ▴ 250

Simplest and I'd wager the fastest method:

Use Jim Kent's utils.

Awesome Tools

Specifically, faSomeRecords:

---> faSomeRecords 
faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa
options:
 -exclude - output sequences not in the list file.

Or, if it really is only one sequence you are after, use faOneRecord

---> faOneRecord
faOneRecord - Extract a single record from a .FA file
usage:
 faOneRecord in.fa recordName
ADD COMMENT
0
Entering edit mode

The faSomeRecord is truly amazing.. Don't require to install BIOpython and has worked so rapidly, that I initially believe that it did not work at all

ADD REPLY
2
Entering edit mode
13.2 years ago
4Urelie ▴ 20

I needed to do the same thing, but I want to use a file with all the IDs in it, that is taken as an argument of the program (at the end I want to create automatically IDfiles and to a pipeline)

I am a perl beginner, so it might be not clean, but I modified the script above and in my hands it is working.

And, also, thank you for this script, H. Ashrafi!

#!/usr/bin/perl

#######################################################
# Author: Hamid Ashrafi                              
# email: ashrafi@ucdavis.edu
# Pupose: It reads the fasta file and another file with the IDs that you want to extract from the FASTA file
# then print the IDs and seqencing that are matched in the FASTA file.
#
# Requirement. Bio Perl
#
# Modified by Aurelie K
#   -> put the names of files in arguments (and also use the STDOUT for the output file)
#   -> suppr index file generated during the script
######################################################

#usage = path_to/fasta_getseq.pl path_to/fasta_library path_to/IDFILE > log

if ($#ARGV != 1) {
 print "usage:  fasta_getseq.pl IDFILE fasta_library > outputfile\n";
 exit;
}

use strict;
use Bio::DB::Fasta;

my $database;
my $fasta_library = $ARGV[1];   #path to fastalibrary in the second argument
my %records;

open IDFILE, "<$ARGV[0]" or die $!; #first argument is the path of the file containing all the IDs you need to extract
open OUTPUT, <STDOUT>;

# creates the database of the library, based on the file
$database = Bio::DB::Fasta->new("$fasta_library") or die "Failed to creat Fasta DP object on fasta library\n";

# now, it parses the file with the fasta headers you want to get
while (<IDFILE>) {

  my ($id) = (/^>*(\S+)/);  # capture the id string (without the initial ">")
  my $header = $database->header($id);
  #print "$header\n";
  print ">$header\n", $database->seq( $id ), "\n";
  print OUTPUT ">$header\n", $database->seq( $id ), "\n";
}

#remove the index file that is useless for user
unlink "$fasta_library.index";

#close the filehandles
close IDFILE;
close OUTPUT;
exit;
ADD COMMENT
0
Entering edit mode

Good!It worked very well!

ADD REPLY
1
Entering edit mode
13.2 years ago

I use the following codes to extract fastas from a database and print each of my targets into different fastas.

#!/usr/bin/perl 
use Bio::Perl;
use Bio::SeqIO;
 use IO::String;
 use Bio::SearchIO;
 open TF,'File_contain_target_fasta_IDs.txt' 
 or die "Couldn't open this file: $!";   
 chomp (@genes_name=<TF>); 
 my $filename='Fasta_database.fasta';

      my $gb = Bio::SeqIO->new(-file   => "<$filename",
                                -format => "fasta");
     while($seq = $gb->next_seq) { 
        my @temp_array=grep {$seq->id eq $_}@genes_name;
        #this step is import for correct output
      foreach (@temp_array){
      #Only one element is given to $string, which is the key point for writing single sequence into one fasta file
         my $string=$_;
         #print $string,"\n";
         my $stringio = IO::String->new($string);
         #If the foreach function is not used here, the defaut $_, actually, here would be many $_ been transfered into $string, and this is resulted in wrong sequences written into one fasta file.
           my $out = Bio::SeqIO->new(-fh => $stringio,
                                     -format => 'fasta');
           # output goes into $string
           $out->write_seq($seq);
           # modify $string
           #$string =~ s|(>)(\w+)|$1$2|g;
           # print into STDOUT
           #print $string;  
           #subrountin for writing sequences into separated fasta file
           write_into_fasta($seq->id,$string);
                      }
        }

sub write_into_fasta {
    my $fh=$_[0];
    open $fh,">$fh.fasta" or die "Couldn't open this file: $!";
    print $fh $_[1],"n";
  close $fh;
}
ADD COMMENT
1
Entering edit mode
12.9 years ago

in terms of performance, from all open-source options that I've tried, the best so far has been cdbfasta:
http://sourceforge.net/projects/cdbfasta/
You first create an index, and then put all your sequence ids in a text file, and use cdbyank to fetch them like this:

/path/to/cdbyank big_file.fasta.cidx < list_of_ids.txt -o my_ids.fasta

ADD COMMENT
0
Entering edit mode
13.5 years ago
Manisha • 0

but if i have multiple sequences to be extracted then what should i do? if i put all sequence id in one text file and give it as input then the above program is not working

ADD COMMENT
2
Entering edit mode

please, ask a new question for this problem, citing this post.

ADD REPLY
0
Entering edit mode

I edited my answer

ADD REPLY
0
Entering edit mode

I've written a tool (called xcerpt) which takes a list of read ID's, and then scans a fasta file to extract them. No indexing or other work necessary, but of course, every time you call it, it will scan the whole file until all requested seqences are found (3GB ~ half a minute). http://hackage.haskell.org/package/clustertools

ADD REPLY
0
Entering edit mode
12.9 years ago
Taslima • 0

I have changed a few of cheng zhongshan's script.just added to read from a file where i expect the ids should by separated by new line. here it goes:


#!/usr/bin/perl -w
use Bio::Perl;
use Bio::SeqIO;
use IO::String;
use Bio::SearchIO;

my $input = shift;
my $id_file = shift;
my $output = shift;
my @genes_name;

open(FILE,$id_file) or die "Can't open $id_file";
while (<FILE>) {
my $id =$_;
chomp($id);
push @genes_name, $id;
}

my $gb = Bio::SeqIO->new(-file   => "<$input",
                              -format => "fasta");

my $fa = Bio::SeqIO->new(-file   => ">$output",
                              -format => "fasta",
                              -flush  => 0); # go as fast as we can!

while($seq = $gb->next_seq) { 
 #Sorry! Here would be with problem, if we use this "if (grep {$_=$seq->id} @genes_name;"
$fa->write_seq($seq) if (grep {$_ eq $seq->id} @genes_name);
}
ADD COMMENT
0
Entering edit mode
12.9 years ago
Ryan Thompson ★ 3.6k

It's not Perl, but pyfasta is really easy to use for this purpose. From the help text:

Usage: extract some sequences from a fasta file. e.g.:
               pyfasta extract --fasta some.fasta --header at2g26540 at3g45640
ADD COMMENT
0
Entering edit mode
11.6 years ago
Ketil 4.1k

The easiest is probably to use 'agrep':

agrep -d '^>' regex

will extract all sequences matching the regular expression regex. If you have a rough idea of sequence length, and you don't have agrep installed, you can use -A option to grep to look for the header line and extract a fixed number of lines after it. Or you can use awk:

awk 'BEGIN { RS=">" } /regex/ { print }'

which is equivalent to the agrep command.

ADD COMMENT
0
Entering edit mode

This code, although interesting, is risky to use. First, it removes the '>' in front of the sequence name. Second, it matches anything between the record markers, not only the name. What if my sequence name is ATT and ATT is found in many of the sequences or, worst, one of the sequences is ATT? Third, unless you specify a beginning of line and end of line in your regex, it will also recover partial matches (like getting sequence222 and sequence22 when searching sequence2).

ADD REPLY
1
Entering edit mode
for i in `cat name.txt`;do awk 'BEGIN{RS=">"} /'$i'/{print ">" $0}' test.fas;done

This code can be used for multiple name searching.

ADD REPLY
0
Entering edit mode
6.2 years ago
andplaia • 0

Hi all, so I know this post is from a long time ago, but I,ve been searching for an easier way to do this, and in the original question he says he wants to use perl... but there's the, not that new, easel miniapps that comes with HMMER, first you need to index your fasta with

 esl-sfetch --Index seq_file.fasta

and then you simply do this command

 esl-sfetch seq_file.fasta "headeryouwanttoretrieve"

or from a list file with the headers of the sequences you want (with no ">")

 esl-sfetch -f seq_file.fasta myhits.list > myhits.fasta

you can see more of this in here:

https://cryptogenomicon.org/2011/05/03/extracting-hmmer-results-to-sequence-files-easel-miniapplications/

ADD COMMENT

Login before adding your answer.

Traffic: 2450 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6