Extract Fasta Sequences Sub Sets
9
5
Entering edit mode
10.5 years ago
prp291 ▴ 70

I have a FASTA file with several sequences, like this:

>AT1G01250.1 | Symbols: | Integrase-type DNA-binding superfamily protein | chr1:104731-105309 REVERSE LENGTH=192 MSPQRMKLSSPPVTNNEPTATASAVKSCGGGGKETSSSTTRHPVYHGVRKRRWGKWVSEIREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPRDIQVAAAKAANAVKIIKMGDDDVAGIDDGDDFWEGIELPELMMSGGGWSPEPFVAGDDATWLVDGDLYQYQFMACL

>AT1G03800.1 | Symbols: ERF10, ATERF10 | ERF domain protein 10 | chr1:957261-957998 REVERSE LENGTH=245 MTTEKENVTTAVAVKDGGEKSKEVSDKGVKKRKNVTKALAVNDGGEKSKEVRYRGVRRRPWGRYAAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFPLIGYYGISSATPVNNNLSETVSDGNANLPLVGDDGNALASPVNNTLSETARDGTLPSDCHDMLSPGVAEAVAGFFLDLPEVIALKEELDRVCPDQFESIDMGLTIGPQTAVEEPETSSAVDCKLRMEPDLDLNASP

I have another file like this

AT1G01250   45  102
AT1G03800   65  109

Now I want to extract the sequences from file using the coordinates given in file 2. For example, I want to extract the portion of >AT1G01250 from position 45 to position 102. Any Help will be greatly appreciated. I am a Windows user.

fasta • 15k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
7
Entering edit mode
10.5 years ago
  1. Install linux .
  2. a little perl script might be appropried ( load the fasta file (with bioperl) and the position file, cut with substring, write in a file )
ADD COMMENT
9
Entering edit mode

Serious question, why are people up voting this answer? No offense, but it does not come close to answering the question and is technically incorrect. This is especially confusing because there are multiple answers to the question. Is it because people think "install linux" is funny? I don't think that is very helpful and I wish we as a community would use votes more judiciously to promote responses that are correct.

ADD REPLY
0
Entering edit mode

The question is serious, and so is the answer. It answers a meta-question that is implicit from the OP. Namely, that person is in a research context where he/she will have to play with fasta files or other genetic-oriented formats and to some manipulations that require scripts or custom approaches based on ongoing investigation of the data. In such a context, the meta-question is: What tools should I use to explore such data sets in a flexible and powerful manner? The answer is to install and use Linux or another UNIX-compatible system (eg: MacOSX) and learn some programming skills (R, Python, Perl...). I do agree however that this is not an answer to the OP per se and we could encourage such suggestions to be posted as comments instead, where their value would still be high without distracting from the OP getting a usable answer.

ADD REPLY
3
Entering edit mode

Certainly, this response answers a question, but not the question, which you seem to agree with. That is not the issue though, and I don't have a problem with the answer itself. The issue is that there are actual working answers to the question and this response has the most up votes. I don't feel that it's too idealistic to think that the most helpful answers are the ones that should be promoted and rewarded.

ADD REPLY
0
Entering edit mode

You don't need linux to use Perl or Bioperl.

ADD REPLY
0
Entering edit mode

I know but for bioinformatics yes ;)

ADD REPLY
0
Entering edit mode

Yes I know that but I am looking for that PERL script If any one can help me with

ADD REPLY
2
Entering edit mode

Yes, but next week you will be looking for another Perl script, and then another, and another. I think you would do yourself a big favour if you take the time to learn Linux and either Python or Perl. If you think bioinformatics will be important in your project, make this priority number 1. In the long run, you will a) save time b) do better science c) explore more interesting questions and d) have a better career. ;)

ADD REPLY
6
Entering edit mode
10.5 years ago

using windows and your browser: if your set of sequences is not SOOO big, you can try to use the following HTML5 page (adapted from How to compute the nucleotide composition of large number of sequences (preferably with anonline tool))

ADD COMMENT
0
Entering edit mode

Thank you very much for your great help

ADD REPLY
4
Entering edit mode
10.5 years ago

The following will work in R, which you presumably already have installed if you're doing anything with bioinformatics. This assumes the fasta file is called "blah.fa" and the regions are stored in a file called "regions.txt". I should note that I changed the first column of the regions so that they match the fasta file (e.g., change AT1G01250 to AT1G01250.1), though you could do this in R or do the matching differently.

library(seqinr)

f <- read.fasta("blah.fa", seqtype="AA")
regs <- read.table("regions.txt", header=F)

getAAseqs <- function(x) {
    idx <- which(names(f) == x[1])
    paste(f[[idx]][x[2]:x[3]], collapse="")
}
subseqs <- apply(regs, 1, getAAseqs)

It's not pretty, but it'll work.

ADD COMMENT
3
Entering edit mode
10.5 years ago

Hi- Let's try this one. It's python so it should work on Windows with any line terminator

#!/usr/bin/env python

import sys
import re

FASTA= sys.argv[1]
BED= sys.argv[2]

fasta= open(FASTA, 'U')
fasta_dict= {}
for line in fasta:
    line= line.strip()
    if line == '':
        continue
    if line.startswith('>'):
        seqname= line.lstrip('>')
        seqname= re.sub('\..*', '', seqname)
        fasta_dict[seqname]= ''
    else:
        fasta_dict[seqname] += line
fasta.close()

bed= open(BED, 'U')
for line in bed:
    line= line.strip().split('\t')
    outname= line[0] + ':' + line[1] + '-' + line[2]
    print('>' + outname)
    s= int(line[1])
    e= int(line[2])
    print(fasta_dict[line[0]][s:e])
bed.close()
sys.exit()

Save it as getFasta.py and execute it as:

python getFasta.py in.fasta in.bed > out.fasta

Note that it strips from the names of the fasta sequences everything after the dot to comply with the example sequences. The fasta file is read in memory so not very clever but it's quick if the bed file has million of intervals to extract.

ADD COMMENT
0
Entering edit mode

I tried this solution also. It's working fine

ADD REPLY
1
Entering edit mode
10.5 years ago
arnstrm ★ 1.8k

This is the dirty way to do it in bash command line. First, convert fasta to tabular form (with just ID and Sequence)

cut -d " " -f 1 sequences.fa | tr -s "\n" "\t"| sed -s 's/>/\n/g' > sequences.tab

Then use the coordinates file to cut the desired portion of the sequence

while read id start end; do \
g=$(grep "$id" sequences.tab | cut -f 2 | cut -c $start-$end);\
echo ">$id";\
echo $g;\
done<coordinates.txt

This will output:

>AT1G01250
YHGVRKRRWGKWVSEIREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPE
>AT1G03800
AAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFP

I know this is not the best method, but I feel it is easier doing it with bash.

ADD COMMENT
3
Entering edit mode

I wish I could give this a vote for effort, but there is nothing "easier" in your answer. The OP is using Windows, so you're off-base, and you're taking advantage of features (newlines where OP will have carriage returns, no whitespace in the sequence names) that are not guaranteed. In addition, you'll lose all line wrapping from the original file. You mention that you know this is not the best method, so I hope you can take this as constructive criticism and don't get too discouraged :)

ADD REPLY
0
Entering edit mode

Sorry, I missed the "windows" part! But yes, this solution works (if you're in UNIX environment AND using above sequence(s)). I said it "easier" because it only has two one-liners.

ADD REPLY
0
Entering edit mode
10.5 years ago
always_learning ★ 1.1k

Dear , Here is the Perl solution for this and can run it on Windows easily.

#user/bin/perl 

open (IN1,"$ARGV[0]");
open (IN2, "$ARGV[1]");
while(<IN1>){
if( /(^>.[\d|\w].*\.).*(=.*$i)/){
$fir = $1;
$last = $2;
$fir =~ s/[\>|\.]//g;
$last =~ s/[=|\d|\s]//g;
$val{$fir} = $last;
             }
                                   }
 while(<IN2>){
 @bed = split("\t", $_);
  $string = substr ($val{$bed[0]}, $bed[1], $bed[2]-$bed[1]);
   print "$string\n";
                }

Run this code as perl code-name.pl test.fasta bed.txt.

ADD COMMENT
1
Entering edit mode

This code won't even compile and it needs a bit of effort to be a working script. There are many problems, starting with the first line. I recommend testing things before posting, and coding in a modern style so your script will work on other computers (and with recent versions of Perl). If you use modern pragmas (lexical variables, enable strictures and warnings, perform tests on arguments, etc.) you will save a lot of time debugging.

ADD REPLY
0
Entering edit mode

Yes !! Thanks for recommendation. But this code is all working on Unix Machine :) :) !! Yes I didn't write it as most optimum way and didn't add warning and all but reason was to give a workable code only !! Thanks again !!

ADD REPLY
0
Entering edit mode

Interesting that you say it works. What version of Perl are you using, out of curiosity? I don't see how this could even compile (unless your Perl is fairly old).

ADD REPLY
0
Entering edit mode

Its perl v5.10.1 :)

ADD REPLY
0
Entering edit mode

Thanks for the response. With 5.14+ strictures and warnings are enabled by default so this type of code will just blow up with a bunch of warnings. With 5.10 it will compile, you are correct. Though, this is a bad thing because it will fail silently, which is something to avoid.

ADD REPLY
0
Entering edit mode

Yes, I am graciously accepting that I should add the warnings and exception handler but I didn't because I gave a workable code only !! !! :) Thanks again

ADD REPLY
0
Entering edit mode
8.6 years ago

Hi, i've noticed that the post is old, but anyway here you have a solution using BioPerl module Bio::DB::Fasta. It works using a fasta file with all your sequences and a txt file with the exact id (tab) start (tab) stop.

#!/usr/bin/perl -w

use Bio::DB::Fasta;

#Usage: extract_substring.pl file.fasta coordinates.txt (where: id, start, stop) > out.fasta

my $fasta = $ARGV[0];
my $query = $ARGV[1];
my ($id,$start,$stop);

my $db = Bio::DB::Fasta -> new($fasta);

open (IN1, $query);
  while (<IN1>) {
    ($id,$start,$stop) = split "\t";
    my $subseq = $db->subseq($id,$start,$stop);
    print ">", $id, "_", $start, "_", $stop; 
    print $subseq, "\n";
  }
close IN1;
ADD COMMENT
0
Entering edit mode
8.6 years ago

I am a Windows user.

You might try using pyfaidx: it works well on Windows.

ADD COMMENT
0
Entering edit mode
8.1 years ago

1 install samtools

2 create the index matching your genome of interest (if genome of course)

samtools faidx genome.fa

3 use a derived command of the type

samtools faidx genome.fa chr2:10000-12000

to extract the region it goes much faster than perl as it has a hash to retrieve data from the fasta file

;-)

ADD COMMENT
0
Entering edit mode

Note that the OP said "I am a Windows user". AFAIK samtools does not play nicely on Windows. However, pyfaidx works great.

pip install pyfaidx
faidx genome.fa --bed regions.bed

This seems to be what the OP wanted.

ADD REPLY

Login before adding your answer.

Traffic: 2249 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6