How to find start and end position of sequence from consensus fasta file?
1
0
Entering edit mode
19 months ago
沛煒 • 0

Hello

I have performed consensus function in samtools and got the consensus fasta sequence from BAM file, and missing positions were replaced with 'N'.

I would like to find my target sequence's start and end position. My sequence is like

>chr1
NNNNAGTATANNNTATGNNNNN

all the heads and ends are N, and my target sequence is in the middle, while some N could be found in the target sequence, I only want to find the start and end of the position where the sequence is not 'N'.

For example, in my case, the start and end position is chr1:5-17 (including three N in the middle of the target sequence).

I have about 1,000 sequences like my example in the same fasta file, Does anyone know how to find the start and end position for each sequence?

Thanks a lot

fasta samtools • 918 views
ADD COMMENT
0
Entering edit mode

Maybe map the sequences with some aligner like gmap, and then convert the output bam file to bed so that you get the mapping coordinates of your sequence?

ADD REPLY
1
Entering edit mode
18 months ago
5heikki 11k

It's assumed here that your fasta file is linear (no linebreaks in sequences). Ugly, but should work..

cat file
>chr1
NNNNAGTATANNNTATGNNNNN
>chr2
NNNNNNNNAGTATANNNTATGNNNNN

paste -d $'\t' \
    <(paste -d $'\t' - - < file) \
    <(paste -d $'\t' \
        <(paste -d $'\t' - - < file | awk 'BEGIN{FS="\t"}{print $2}' | awk 'BEGIN{FS="A|G|C|T"}{print length($1)}') \
        <(paste -d $'\t' - - < file | awk 'BEGIN{FS="\t"}{print $2}' | rev | awk 'BEGIN{FS="A|G|C|T"}{print length($1)}')) \
    | awk 'BEGIN{FS="\t"}{print substr($1,2)":"$3+1"-"length($2)-$4}'
chr1:5-17
chr2:9-21
ADD COMMENT
0
Entering edit mode

That works! Thanks for your help

ADD REPLY

Login before adding your answer.

Traffic: 2736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6