Question

How can I implement RY-coding to every third nucleotide in an alignment using command line?

0

Entering edit mode

5.8 years ago

lzhou • 0

I have a FASTA alignment of 13 genes from 200 different species. How would I go about changing every third nucleotide to either R or Y using command line?

Thanks!

alignment • 2.1k views

ADD COMMENT • link updated 5.8 years ago by h.mon 35k • written 5.8 years ago by lzhou • 0

0

Entering edit mode

Does that have to follow any rule or is it supposed to randomly set each third base to R or Y? And most importantly, what are you trying to achieve?

ADD REPLY • link 5.8 years ago by cschu181 ★ 2.8k

0

Entering edit mode

I am looking to change all purines to R and pyrimidines to Y.

ADD REPLY • link 5.8 years ago by lzhou • 0

1

Entering edit mode

If you want to change only every third base:

awk '/^>/ { print $0; next} { $1=gensub("([ACGT][ACGT])[AG]", "\\1R", "g", $1); $1=gensub("([ACGT][ACGT])[CT]", "\\1Y", "g", $1); print $1;}' your_fasta_file

If you want to change all purines/pyrimidines to R/Y:

awk '/^>/ { print $0; next} { print gensub("[AG]", "R", "g", gensub("[CT]", "Y", "g", $1)); }' your_fasta_file

ADD REPLY • link 5.8 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Hi, I tried running your command and I got the error message "awk: line 2: function gensub never defined". Is there something else I need to do to the command/or my fasta file?

ADD REPLY • link 5.8 years ago by lzhou • 0

0

Entering edit mode

Maybe you need GNU awk. Are you on a Mac by chance? Then you'd need to install the GNU tools.

ADD REPLY • link 5.8 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Try following with gnu-sed: For replacing every third base with appropriate symbol:

$ sed -e '/^>/! s/\(..\)[A,G]/\1R/Ig;s/\(..\)[T,C]/\1Y/Ig' test.fa

For replacing all the bases with appropriate symbol:

$ sed '/^>/! s/[A,G]/R/Ig;s/[T,C]/Y/Ig' test.fa

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

It appears that it is replacing nucleotides, but not every 3rd base (or it appears random to me). For example, ATG CTT AAC became ATR YTT RAY. I think the species names in the fasta sequences might be causing this problem, I see Y's and R's in the names. Can I get it to start and stop searching and replacing after each line break (there is a line break after the species name, and one at the end of the sequence)? Thank you for your help!

ADD REPLY • link 5.8 years ago by lzhou • 0

1

Entering edit mode

It is not random. It is working as per pattern. But pattern is incorrect or it must be constrained. Will post once i am done with it. Thanks. @OP

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

Thank you for your help, much appreciated!

ADD REPLY • link 5.8 years ago by lzhou • 0

0

Entering edit mode

Fasta files must be flattened (seqkit seq -w 0 input fasta for flattening fasta)

ADD REPLY • link 5.8 years ago by cpad0112 21k

1

Entering edit mode

I am trying to find the best partitioning scheme for the data set.

How replacing purines to R and pyrimidines to Y will help find the best partitioning scheme? And what do you mean by "best partitioning scheme"?

ADD REPLY • link 5.8 years ago by h.mon 35k

score 2 · Answer 1 · 2018-07-17

2

Entering edit mode

5.8 years ago

h.mon 35k

I found the reason why to replace purines by R and pyrimidines by Y, and a sweetly named script to do so (RYplace.py), here:

https://github.com/ballesterus/Utensils

ADD COMMENT • link 5.8 years ago by h.mon 35k