I am working with a FASTA file in which each read contains a repetitive sequence of variable length at the 5' end. For instance, in the below file:
>seq1
CCCCAAAACCCCAAAACCCCGATGATCATGGATC
>seq2
CCCCAAAACCCCGATGGCATCATTCA
>seq3
CCCCAAAACCCCAAAATATGTTGCTACTAG
I would like to remove the repetitive sequence of C's and A's from the 5' end of each read, but whatever solution I use should take into account that there may be any number of repetitive units, including a repetitive C block without a subsequent A block (see "seq2" above).
If this can be done in the Mac OSX command line, that would be optimal. I am also interested in software packages that may be able to accomplish this. Thank you for any help you can offer.
While
sed -i
is useful, this will destroy the original data which is maybe not what OP wants. I advice against using-i
unless you are sure that your command is the right one and you no longer need the original file.I addition, this sed will not remove CCCCCAAAAACCCCCAAAAA completely...
I suppose the
-i
is arguable, but I removed it. It was really meant as a suggestion. I assume people don't just copy and paste random commands from the internet, but that's a big assumption.And fixed the pattern. I forgot the
^
was there.I was able to figure it out using extended regular expressions and just running a couple of different scripts to make sure I removed every repetitive instance:
Thanks a lot, everyone.
The
*
would only apply to the previous character (A), not the entire string.Note:
-E
works on BSD sed, but it would be-r
on GNU sed.