awk to clip off telomeres
4.1 years ago
rm16 • 0

I want to use the Mac OSX terminal to clip repetitive sequences from the beginning of the each sequence in a fasta file. For example, I would like to make the following file:

>seq1
CCCCAAAACCCCATGATCATGGATC
>seq2
CCCCAAAACCCCATGGCATCATTCA
>seq3
CCCCAAAACCCCATGTTGCTACTAG


become:

 >seq1
ATGATCATGGATC
>seq2
ATGGCATCATTCA
>seq3
ATGTTGCTACTAG


by clipping off the CCCCAAAACCCC at the beginning of each sequence. Is there a way I can do this in the OSX terminal?

Why don't you use something like fastx? http://hannonlab.cshl.edu/fastx_toolkit/index.html

That seems pretty handy. Thank you for the tip. I'll try it out.

21 months ago
This seems like more of a sed query than an awk one. Specific solution to your query below:

sed -e 's/^CCCCAAAACCCC//g' inputfile.txt > outputfile.txt


(This removes the sequence of CCCCAAAACCCC from the beginning of any line).

The general solution (which I imagine is what you really want) is harder because you need to set criteria for what the sequence can comprise... For example, if you just wanted to strip the first 12 n'tides from your sequences the following would work:

sed -e 's/^[A,C,T,G]\{12\}//g' inputfile.txt > outputfile.txt


...but I imagine that is too simplistic?

Either way, you should check out sed substitutions and regular expressions.

That sounds like a good starting place. I think I can adapt this. Thanks a lot!

22 months ago
Farbod ♦ 3.3k
Hi ,

have you tried BioAwk ?

https://github.com/lh3/bioawk

18 months ago
You can also try SeqKit, usage of subseq

Subseq from 13th to last base (-1):

seqkit subseq -r 13:-1 seq.fa > out.fa