Question

awk to clip off telomeres

0

Entering edit mode

7.6 years ago

rm16 • 0

I want to use the Mac OSX terminal to clip repetitive sequences from the beginning of the each sequence in a fasta file. For example, I would like to make the following file:

>seq1
CCCCAAAACCCCATGATCATGGATC
>seq2
CCCCAAAACCCCATGGCATCATTCA
>seq3
CCCCAAAACCCCATGTTGCTACTAG

become:

 >seq1
ATGATCATGGATC
>seq2
ATGGCATCATTCA
>seq3
ATGTTGCTACTAG

by clipping off the CCCCAAAACCCC at the beginning of each sequence. Is there a way I can do this in the OSX terminal?

osx terminal awk telomere fasta • 2.4k views

ADD COMMENT • link updated 7.6 years ago by Farbod ★ 3.4k • written 7.6 years ago by rm16 • 0

1

Entering edit mode

Why don't you use something like fastx? http://hannonlab.cshl.edu/fastx_toolkit/index.html

ADD REPLY • link 7.6 years ago by Benn 8.3k

0

Entering edit mode

That seems pretty handy. Thank you for the tip. I'll try it out.

ADD REPLY • link 7.6 years ago by rm16 • 0

GouthamAtla · Answer 1 · 2016-09-02

1

Entering edit mode

7.6 years ago

coleman_jonathan ▴ 470

This seems like more of a sed query than an awk one. Specific solution to your query below:

sed -e 's/^CCCCAAAACCCC//g' inputfile.txt > outputfile.txt

(This removes the sequence of CCCCAAAACCCC from the beginning of any line).

The general solution (which I imagine is what you really want) is harder because you need to set criteria for what the sequence can comprise... For example, if you just wanted to strip the first 12 n'tides from your sequences the following would work:

sed -e 's/^[A,C,T,G]\{12\}//g' inputfile.txt > outputfile.txt

...but I imagine that is too simplistic?

Either way, you should check out sed substitutions and regular expressions.

ADD COMMENT • link updated 7.6 years ago by GouthamAtla 12k • written 7.6 years ago by coleman_jonathan ▴ 470

0

Entering edit mode

That sounds like a good starting place. I think I can adapt this. Thanks a lot!

ADD REPLY • link 7.6 years ago by rm16 • 0

score 1 · Answer 2 · 2016-09-04

1

Entering edit mode

7.6 years ago

Farbod ★ 3.4k

Hi ,

have you tried BioAwk ?

https://github.com/lh3/bioawk

ADD COMMENT • link 7.6 years ago by Farbod ★ 3.4k

score 0 · Answer 3 · 2016-09-03

0

Entering edit mode

7.6 years ago

shenwei356 8.4k

You can also try SeqKit, usage of subseq

Subseq from 13th to last base (-1):

seqkit subseq -r 13:-1 seq.fa > out.fa

ADD COMMENT • link 7.6 years ago by shenwei356 8.4k