Question

Extracting the coordinates for a specific base from multifasta

0

Entering edit mode

5.4 years ago

Louis Kok ▴ 30

Hi, I would like to extract coordinates for a specific base, let say all "A" base from multi-fasta. The way I I used to extract from a single sequence (without header) is as below:

grep -aob "A"  input.seq | sed 's/:/ /g' | awk '{print $1+1}'

Is there an easier way to deal with large and multi fasta? Thanks.

sequence fasta • 1.5k views

ADD COMMENT • link updated 5.4 years ago by Joe 21k • written 5.4 years ago by Louis Kok ▴ 30

score 3 · Answer 1 · 2018-11-28

linearize, read each pair of name/sequence; print the name ; echo the sequence, convert each base to a new line with grep -o ., get the line number with nl, grep the lines ending with 'A'.

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
while IFS=$'\t' read A B; do echo $A && echo $B | grep -o . | nl | grep 'A$' ;done

>seq1
     5  A
    12  A
    14  A
    18  A
    23  A
    27  A
    30  A
    32  A
    44  A
>seq2
     7  A
     8  A
    10  A
    14  A
    18  A
    19  A
    23  A
    26  A
    28  A
    34  A
    40  A
    43  A

score 3 · Answer 2 · 2018-11-28

using seqkit:

$ seqkit locate -ip "a" test.fa 

seqID   patternName pattern strand  start   end matched
seq1    a   a   +   1   1   a
seq1    a   a   +   7   7   a
seq1    a   a   +   11  11  a
seq1    a   a   -   10  10  a
seq1    a   a   -   4   4   a
seq2    a   a   +   1   1   a
seq2    a   a   +   7   7   a
seq2    a   a   +   11  11  a
seq2    a   a   -   10  10  a
seq2    a   a   -   4   4   a

$ cat test.fa 
>seq1
agctggagctacc
>seq2
agctggagctacc

score 1 · Answer 3 · 2018-11-28

A Python option:

# Define a test sequence
>>> string = 'ATAGCTACGTACGTACGTACGTACGTACGA'

# A function for finding all substrings
>>> find_all = lambda c,s: [x for x in range(c.find(s), len(c)) if c[x] == s]

# Gives the following indices for 'A' characters
>>> find_all(string, 'A')
[0, 2, 6, 10, 14, 18, 22, 26, 29]

# To apply it to all sequences in a multifasta:
>>> from Bio import SeqIO
>>> seqs = list(SeqIO.parse('seqs.fa','fasta'))
>>> [find_all(str(s.seq), 'A') for s in seqs]
[[80, 85, 92, 97, 98, 101, 103, 119, 121, 122, 125, 127, 129, 130, 148],    # First seq
 [86, 98, 103, 104, 107, 125, 127, 128, 131, 133, 135, 136, 154],           # Second seq, etc.
 [74, 79, 86, 91, 92, 95, 113, 115, 116, 119, 121, 123, 124, 142], 
 [74, 79, 86, 91, 92, 95, 113, 115, 116, 119, 120, 121, 123, 124, 142]]

(Input data used)

>mutant
GTTGGGAGGCTATGTGTTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>gorrila
GTTGGGAGGCTATGTGTGACTGGAAGGACATCCTGTCGGGTGGCGAGAAGCAGAGAATC
>chimpanze
GTTGGGAGGCTGTGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>human
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>olive
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAAAGAATC