How to extract a sequence from a multi-fasta file based on its position?
5
0
Entering edit mode
4.9 years ago
Kumar ▴ 120

Hi all,

I have 6 sequences in a fasta file as shown below,

>seq1
ATGAT
>gen1
TAGTA
>org2
TATAA
>seq7
ATGCA

I would like to extract a sequence based on its position, for example, I need to extract 3rd position fasta sequence,

 >org2
TATAA

I can use samtools faidx command for fasta header/id based sequence extraction. But, here I need to extract based on its position. Please help me to do the same.

Thanks in advance

fasta perl python shell-script • 3.0k views
ADD COMMENT
1
Entering edit mode

with sed and grep (assuming that fasta is linearized) with OP fasta:

$ sed -n '1~2p' test.fa | sed -n 3p | grep -A 1 -f - test.fa
>org2
TATAA
ADD REPLY
2
Entering edit mode
4.9 years ago
AK ★ 2.2k

Hi Dineshkumar K,

Try:

samtools faidx test.fa "$(sed -n 3p <(cut -f1 test.fa.fai))"

Or both:

seqkit grep -p "$(sed -n 3p <(grep '^>' test.fa | sed 's/^>//g'))" test.fa
seqkit fx2tab test.fa | awk 'NR=="3"{print}' | seqkit tab2fx
ADD COMMENT
0
Entering edit mode

Thank you SMK, It works fine.

ADD REPLY
2
Entering edit mode
4.9 years ago
h.mon 35k

Pure awk solution

cat miRNA_corn.fa \
  | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} \
              {printf("%s",$0);} \
         END {printf("\n");}' \
  | awk '{if ( NR==3 ) print $0;}' \
  | awk '{print $1"\n"$2}'

Steps consist of

  1. linearize fasta file, separating header and sequence by tabs

  2. print specific line number

  3. convert the sequences from tabular to fasta

ADD COMMENT
1
Entering edit mode
4.9 years ago
ATpoint 81k

Using a mix of bash and samtools:

function ExtractByPosition {

  FASTA=$1
  POSITION=$2

  if [[ ! -e ${FASTA}.fai ]]; then samtools faidx $FASTA; fi

  grep '>' $FASTA \
    | head -n $POSITION \
    | tail -n 1 \
    | awk '{gsub(">","");print}' \
    | xargs samtools faidx $FASTA

}

ExtractByPosition your.fasta 3
ADD COMMENT
1
Entering edit mode
4.9 years ago
michau ▴ 60

python with Biopython

from Bio import AlignIO

N = 3           # position of seq you want to retreive
with open('yourALN.fa' 'r') as infile:
    aln = AlignIO.read(infile, 'fasta')
    print(aln[N-1])
ADD COMMENT
1
Entering edit mode

Just a comment to point out that you don't need to use open() on files in biopython, that's handled internally by the IO methods. OP also isn't using an aligned format (or at least hasn't said they are, and there example data doesn't suggest this), so SeqIO is probably the right choice.

So, this code would probably be better as:

from Bio import SeqIO

n = 3
for i, rec in enumerate(SeqIO.parse('test.fa', 'fasta')):
    if i == n:
        print(">{}\n{}".format(rec.description, rec.seq))
ADD REPLY
1
Entering edit mode
4.9 years ago
Joe 21k

Just for sadism completeness, here's a handy one-line version for extracting any position from a fasta, based on my comment.

python3 -c 'import sys;from Bio import SeqIO; [print(f">{rec.description}\n{rec.seq}") for i, rec in enumerate(SeqIO.parse(sys.argv[1],"fasta")) if i == int(sys.argv[2])-1 ];' test.fa 1

Where n is the sequence you want.

Note, use of fstrings and print-in-list-comprehension will require Python >= 3.6

ADD COMMENT

Login before adding your answer.

Traffic: 1564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6