Biostar Beta. Not for public use.
awk to clip off telomeres
0
Entering edit mode
4.1 years ago
rm16 • 0

I want to use the Mac OSX terminal to clip repetitive sequences from the beginning of the each sequence in a fasta file. For example, I would like to make the following file:

>seq1
CCCCAAAACCCCATGATCATGGATC
>seq2
CCCCAAAACCCCATGGCATCATTCA
>seq3
CCCCAAAACCCCATGTTGCTACTAG


become:

 >seq1
ATGATCATGGATC
>seq2
ATGGCATCATTCA
>seq3
ATGTTGCTACTAG


by clipping off the CCCCAAAACCCC at the beginning of each sequence. Is there a way I can do this in the OSX terminal?

1
Entering edit mode

Why don't you use something like fastx? http://hannonlab.cshl.edu/fastx_toolkit/index.html

0
Entering edit mode

That seems pretty handy. Thank you for the tip. I'll try it out.

1
Entering edit mode
21 months ago
European Union

This seems like more of a sed query than an awk one. Specific solution to your query below:

sed -e 's/^CCCCAAAACCCC//g' inputfile.txt > outputfile.txt


(This removes the sequence of CCCCAAAACCCC from the beginning of any line).

The general solution (which I imagine is what you really want) is harder because you need to set criteria for what the sequence can comprise... For example, if you just wanted to strip the first 12 n'tides from your sequences the following would work:

sed -e 's/^[A,C,T,G]\{12\}//g' inputfile.txt > outputfile.txt


...but I imagine that is too simplistic?

Either way, you should check out sed substitutions and regular expressions.

0
Entering edit mode

That sounds like a good starting place. I think I can adapt this. Thanks a lot!

1
Entering edit mode
22 months ago
Farbod ♦ 3.3k
Toronto

Hi ,

have you tried BioAwk ?

https://github.com/lh3/bioawk

0
Entering edit mode
18 months ago
China

You can also try SeqKit, usage of subseq

Subseq from 13th to last base (-1):

seqkit subseq -r 13:-1 seq.fa > out.fa