Question

Extract Sequence From Fasta File Using Ids From A Separate File

3

Entering edit mode

10.7 years ago

IsmailM ▴ 110

Hi guys,

As shown below, I have two files a fasta file (containing a sequences and sequence IDs) and a text file containing a list of sequence ID.

I would like to extract the respective sequence for each of the sequence IDs in the text file. I have tried a number of times, but I simply cannot even open the fasta file - I'm a total beginner with regards to bioruby - if you could explain any answers in detail, I would be highly grateful.

EDIT: Could you give your answers using Ruby or BioRuby. Thanks

e.g. EXAMPLE FASTA FILE

>CATH_RAT
MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRN
HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCW
>CATH_MOUSE
TFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNG
QCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHA
>CATH_HUMAN
MNPTLILAAFCLGIASATLTFDHSLEAQWTKWKAMHNRLYGMNEEGWRRAVWEKNMKMIELHNQEYREGK
HSFTMAMNAFGDMTSEEFRQVMNGFQNRKPRKGKVFQEPLFYEAPRSVDWREKGYVTPVKNQGQCGSCWA
>CATH_MONKEY
FSATGALEGQMFRKTGRLISLSEQNLVDCSGPQGNEGCNGGLMDYAFQYVQDNGGLDSEESYPYEATEES
CKYNPKYSVANDTGFVDIPKQEKALMKAVATVGPISVAIDAGHESFLFYKEGIYFEPDCSSEDMDHGVLV

EXAMPLE .TXT FILE

   >CATH_MOUSE
   >CATH_HUMAN

Many Thanks

fasta • 29k views

ADD COMMENT • link updated 6.6 years ago by BPors ▴ 60 • written 10.7 years ago by IsmailM ▴ 110

28

Entering edit mode

10.7 years ago

matted 7.8k

Why does it need to be Ruby?

Here's a quick shell command using samtools (better than a buggy one-off Perl/Python/Ruby script in my opinion...):

cut -c 2- EXAMPLE.TXT | xargs -n 1 samtools faidx EXAMPLE.FA

ADD COMMENT • link 10.7 years ago by matted 7.8k

0

Entering edit mode

how should we extract the multiple sequence in fasta format by using the gene_id through samtools?

ADD REPLY • link 7.9 years ago by Bulbul Ahmed ▴ 20

0

Entering edit mode

@Bulbul Did you find the solution?

ADD REPLY • link 6.7 years ago by bioinfo8 ▴ 230

0

Entering edit mode

@matted Would you please elaborate?

ADD REPLY • link 6.7 years ago by bioinfo8 ▴ 230

4

Entering edit mode

10.5 years ago

AGS ▴ 250

The best tool for this, imo, is Jim Kent's faSomeRecords.

While the OP wanted a Ruby-based solution, I think listing this answer for future reference is worthwhile.

---> faSomeRecords 
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

ADD COMMENT • link 10.5 years ago by AGS ▴ 250

0

Entering edit mode

Where can I find this software?

ADD REPLY • link 8.9 years ago by Endre Bakken Stovner ▴ 960

0

Entering edit mode

It's on UCSC

ADD REPLY • link 8.9 years ago by PoGibas 5.1k

0

Entering edit mode

http://hgdownload.cse.ucsc.edu/admin/exe/

ADD REPLY • link 7.1 years ago by Ian 6.0k

0

Entering edit mode

you can install it via anaconda: conda install ucsc-fasomerecords

ADD REPLY • link 7.0 years ago by Ömer An ▴ 260

1

Entering edit mode

10.7 years ago

Ashutosh Pandey 12k

This post is similar. You may have to tweak your code a little.

http://stackoverflow.com/questions/15352219/extract-sequences-from-a-fasta-file-based-on-entries-in-a-separate-file

ADD COMMENT • link 10.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Many thanks is it possible to do something similar in ruby???

It just that I don't really know any python

ADD REPLY • link 10.7 years ago by IsmailM ▴ 110

1

Entering edit mode

10.7 years ago

Fedor Gusev ▴ 210

Here is a script I use:

#!/usr/bin/env ruby

require 'bio'
require 'trollop'

opts = Trollop::options do
  opt :get, "Contigs to get", :type => :string, :required => true
end

get = File.readlines(opts[:get]).map { |x| x.chomp }

Bio::FlatFile.foreach(ARGF) do |entry|
  id = entry.definition.split.first # no idea why I did not use entry.entry_id
  puts entry if get.include?(id)
end

For your data you can run it like this:

ruby ./fasta-get-contigs -g <(cat example.txt | sed 's/^>//') example.fasta > result.fasta

Or you can remove leading > in the script.

But I second samtools faidx answer from @matted: from my experience, fasta parser in bioruby is too slow.

ADD COMMENT • link 10.7 years ago by Fedor Gusev ▴ 210

1

Entering edit mode

6.6 years ago

BPors ▴ 60

If you had only the sequences in the example.txt instead of the identifiers, how would you grab the identifiers and their sequences? ( in this case, the purpose was retrieving the sequences with identifiers, what I am asking is retrieving headers and sequencers with known sequences

ADD COMMENT • link 6.6 years ago by BPors ▴ 60

0

Entering edit mode

In that case you can use BBMap's filterbysequence.sh:

filterbysequence.sh in=a.fa ref=b.fa out=filtered.fa

However, the files have to be valid fasta or fastq.

ADD REPLY • link 6.6 years ago by Brian Bushnell 20k

0

Entering edit mode

10.7 years ago

always_learning ★ 1.1k

           #use/bin/perl 
           use Data::Dumper;
           %val=();
           open (IN1,"$ARGV[0]");
           open (IN2,"id.txt");
            while(<IN1>){
       chomp $_;
       @file= split(">",$_);
    $file[0]=~ s/\s*$//g;
    if ($file[0] ne ''){
    $seq = $file[0];
    }
    $file[1]=~ s/\s*$//g;
     if ($file[1] ne ''){
     $id = $file[1];
}
$val{$id} = $seq;
}
    while(<IN2>){
        $one = $_;
        $one =~ s/\s*$//g;
        print "$one\n";
        print "\n";
        #print Dumper $one;
        print  "$val{$one}\n";
            }

Run this perl file.pl fasta.txt

ADD COMMENT • link 10.7 years ago by always_learning ★ 1.1k

0

Entering edit mode

Another Perl solution:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use extractSeq.pl LIST FASTA OUT\n";

my $list = shift @ARGV;
my $fasta = shift @ARGV;
my $out = shift @ARGV;
my %select;
open L, "$list" or die;
while (<L>) {
    chomp;
    s/>//g;
    $select{$_} = 1;
}
close L;

$/ = "\n>";
open O, ">$out" or die;
open F, "$fasta" or die;
while (<F>) {
    s/>//g;
    my ($id) = split (/\n/, $_);
    print O ">$_" if (defined $select{$id});
}
close F;
close O;

ADD REPLY • link 10.7 years ago by JC 13k

Ram · Accepted Answer · 2015-07-22

This is what I ended up doing - by using 'pure' ruby.

require 'bio'
# create a hash of the Fasta file
def analyse_input_file(input_file)
  input_read = {}
  biofastafile = Bio::FlatFile.open(Bio::FastaFormat, input_file)
  biofastafile.each_entry do |entry|
    input_read[entry.entry_id] = entry.seq
  end
  input_read
end

# Iterate ID file and look up in hash
def iterate_id_file(id_file, input_read, output_file)
  ids = File.read(id_file)
  ids.each_line do |id|
    # remove the inital > and the end of line character.
    id = id.gsub!(/^>/, '').chomp
    seq = input_read[id]
    File.open(output_file, 'a+') do |f|
      f.puts ">#{id}"
      f.puts seq
    end
  end
end

# Set up
input_file = ARGV[0] # '/Volumes/Data/input_file.fa'
id_file = ARGV[1] # '/Volumes/Data/ids.txt'
output_file = "#{input_file}.out"

# Set up
input_read = analyse_input_file(input_file)
iterate_id_file(id_file, input_read, output_file)

Just a quick explanation on how the above works... (I know I would have found this really helpful, when I asked the question)

The main concept used here is the HASH (also known as a dictionary).

The basic arrangement of a Hash is as follows

Hash_name = { 'key_1' => 'value_1', 'key_2' => 'value_2'  }

This is also written as:

Hash_name = { key_1: 'value_1', key_2: 'value_2'}

The main benefit of Hashes is that you can quickly call a value by passing the respective key as follows

puts Hash_name[key_1]
# => value_1

puts Hash_name[key_2]
# => value_2

In the analyse_input_file method, we created a hash where the key was the sequence id (i.e. entry.entry_id) and the sequence itself was the value (i.e. entry.seq)

Then in the iterate_id_file, we simple passed the id to the HASH to get the corresponding sequence....

Simples...