awk to clip off telomeres
3
0
Entering edit mode
7.6 years ago
rm16 • 0

I want to use the Mac OSX terminal to clip repetitive sequences from the beginning of the each sequence in a fasta file. For example, I would like to make the following file:

>seq1
CCCCAAAACCCCATGATCATGGATC
>seq2
CCCCAAAACCCCATGGCATCATTCA
>seq3
CCCCAAAACCCCATGTTGCTACTAG

become:

 >seq1
ATGATCATGGATC
>seq2
ATGGCATCATTCA
>seq3
ATGTTGCTACTAG

by clipping off the CCCCAAAACCCC at the beginning of each sequence. Is there a way I can do this in the OSX terminal?

osx terminal awk telomere fasta • 2.4k views
ADD COMMENT
1
Entering edit mode

Why don't you use something like fastx? http://hannonlab.cshl.edu/fastx_toolkit/index.html

ADD REPLY
0
Entering edit mode

That seems pretty handy. Thank you for the tip. I'll try it out.

ADD REPLY
1
Entering edit mode
7.6 years ago

This seems like more of a sed query than an awk one. Specific solution to your query below:

sed -e 's/^CCCCAAAACCCC//g' inputfile.txt > outputfile.txt

(This removes the sequence of CCCCAAAACCCC from the beginning of any line).

The general solution (which I imagine is what you really want) is harder because you need to set criteria for what the sequence can comprise... For example, if you just wanted to strip the first 12 n'tides from your sequences the following would work:

sed -e 's/^[A,C,T,G]\{12\}//g' inputfile.txt > outputfile.txt

...but I imagine that is too simplistic?

Either way, you should check out sed substitutions and regular expressions.

ADD COMMENT
0
Entering edit mode

That sounds like a good starting place. I think I can adapt this. Thanks a lot!

ADD REPLY
1
Entering edit mode
7.6 years ago
Farbod ★ 3.4k

Hi ,

have you tried BioAwk ?

https://github.com/lh3/bioawk

ADD COMMENT
0
Entering edit mode
7.6 years ago

You can also try SeqKit, usage of subseq

Subseq from 13th to last base (-1):

seqkit subseq -r 13:-1 seq.fa > out.fa
ADD COMMENT

Login before adding your answer.

Traffic: 1874 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6