Remove repetitive sequence of variable length from reads
1
0
Entering edit mode
7.6 years ago
rm16 • 0

I am working with a FASTA file in which each read contains a repetitive sequence of variable length at the 5' end. For instance, in the below file:

>seq1
CCCCAAAACCCCAAAACCCCGATGATCATGGATC
>seq2
CCCCAAAACCCCGATGGCATCATTCA
>seq3
CCCCAAAACCCCAAAATATGTTGCTACTAG

I would like to remove the repetitive sequence of C's and A's from the 5' end of each read, but whatever solution I use should take into account that there may be any number of repetitive units, including a repetitive C block without a subsequent A block (see "seq2" above).

If this can be done in the Mac OSX command line, that would be optimal. I am also interested in software packages that may be able to accomplish this. Thank you for any help you can offer.

fasta sequencing osx • 1.8k views
ADD COMMENT
3
Entering edit mode
7.6 years ago
igor 13k

I think fastx_clipper can do this. I am not sure about how it treats repeats. If it only removes the first one, I suppose you could just run it multiple times. Docs: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

Of course, there is always the sed solution, but then you need to be concerned about the formatting of your FASTA file (sequences may span multiple lines, for example): sed 's/^CCCAAA[\(CCCAAA\)]*//g' file.fasta

ADD COMMENT
1
Entering edit mode

While sed -i is useful, this will destroy the original data which is maybe not what OP wants. I advice against using -i unless you are sure that your command is the right one and you no longer need the original file.

I addition, this sed will not remove CCCCCAAAAACCCCCAAAAA completely...

ADD REPLY
1
Entering edit mode

I suppose the -i is arguable, but I removed it. It was really meant as a suggestion. I assume people don't just copy and paste random commands from the internet, but that's a big assumption.

And fixed the pattern. I forgot the ^ was there.

ADD REPLY
0
Entering edit mode

I was able to figure it out using extended regular expressions and just running a couple of different scripts to make sure I removed every repetitive instance:

sed -E 's/^CCCCAAAA*//g' file.fasta

Thanks a lot, everyone.

ADD REPLY
1
Entering edit mode

The * would only apply to the previous character (A), not the entire string.

Note: -E works on BSD sed, but it would be -r on GNU sed.

ADD REPLY

Login before adding your answer.

Traffic: 2842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6