How to replace missing characters with "?" at the beginning and at the end of the sequence while not touching true gaps in the middle?
2
0
Entering edit mode
8.3 years ago
sturcoal • 0

Suppose I've got the sequence:

ATGCGTTATTGCATGTAGCA--------ATGGCATTACGATCCA-----CCAGGTAC

where "-" characters represent true indels

And after alignment it looks like:

----ATGCGTTATTGCATGTAGCA--------ATGGCATTACGATCCA-----CCAGGTAC------------

I've uploaded it in R using read.FASTA. Now I can extract any sequence as a vector.

My question is:

How to replace each "-" character with "?" at the beginning and at the end of the sequence while not touching true indels in the middle using R?

Thank you!

R sequence indels DNAbin • 1.8k views
ADD COMMENT
2
Entering edit mode
8.3 years ago

You will need to use regular expressions. The difficulty in your question comes frow the varying length of the replacement text ("????...") for the patterns ("----...")

Anyway, there are one liners to do that in terminal using sed or awk (see here) but it is also possible in R with the folowing patterns :

^/-+ to get the "-" at the start.of each sequence.

/-+$ to get the "-" at the end of each sequence.

You should probably use the regexpr() function first to get the length of each pattern first. Then use the gsub() function to replace each pattern by a as many "?" as the precalculated pattern length.

ADD COMMENT

Login before adding your answer.

Traffic: 2996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6