Question

How to find start and end position of sequence from consensus fasta file?

0

Entering edit mode

19 months ago

沛煒 • 0

Hello

I have performed consensus function in samtools and got the consensus fasta sequence from BAM file, and missing positions were replaced with 'N'.

I would like to find my target sequence's start and end position. My sequence is like

>chr1
NNNNAGTATANNNTATGNNNNN

all the heads and ends are N, and my target sequence is in the middle, while some N could be found in the target sequence, I only want to find the start and end of the position where the sequence is not 'N'.

For example, in my case, the start and end position is chr1:5-17 (including three N in the middle of the target sequence).

I have about 1,000 sequences like my example in the same fasta file, Does anyone know how to find the start and end position for each sequence?

Thanks a lot

fasta samtools • 918 views

ADD COMMENT • link 18 months ago by 沛煒 • 0

0

Entering edit mode

Maybe map the sequences with some aligner like gmap, and then convert the output bam file to bed so that you get the mapping coordinates of your sequence?

ADD REPLY • link 19 months ago by iraun 6.2k

score 1 · Answer 1 · 2022-10-08

It's assumed here that your fasta file is linear (no linebreaks in sequences). Ugly, but should work..

cat file
>chr1
NNNNAGTATANNNTATGNNNNN
>chr2
NNNNNNNNAGTATANNNTATGNNNNN

paste -d $'\t' \
    <(paste -d $'\t' - - < file) \
    <(paste -d $'\t' \
        <(paste -d $'\t' - - < file | awk 'BEGIN{FS="\t"}{print $2}' | awk 'BEGIN{FS="A|G|C|T"}{print length($1)}') \
        <(paste -d $'\t' - - < file | awk 'BEGIN{FS="\t"}{print $2}' | rev | awk 'BEGIN{FS="A|G|C|T"}{print length($1)}')) \
    | awk 'BEGIN{FS="\t"}{print substr($1,2)":"$3+1"-"length($2)-$4}'
chr1:5-17
chr2:9-21