How can I implement RY-coding to every third nucleotide in an alignment using command line?
1
0
Entering edit mode
5.8 years ago
lzhou • 0

I have a FASTA alignment of 13 genes from 200 different species. How would I go about changing every third nucleotide to either R or Y using command line?

Thanks!

alignment • 2.1k views
ADD COMMENT
0
Entering edit mode

Does that have to follow any rule or is it supposed to randomly set each third base to R or Y? And most importantly, what are you trying to achieve?

ADD REPLY
0
Entering edit mode

I am looking to change all purines to R and pyrimidines to Y.

ADD REPLY
1
Entering edit mode

If you want to change only every third base:

awk '/^>/ { print $0; next} { $1=gensub("([ACGT][ACGT])[AG]", "\\1R", "g", $1); $1=gensub("([ACGT][ACGT])[CT]", "\\1Y", "g", $1); print $1;}' your_fasta_file

If you want to change all purines/pyrimidines to R/Y:

awk '/^>/ { print $0; next} { print gensub("[AG]", "R", "g", gensub("[CT]", "Y", "g", $1)); }' your_fasta_file
ADD REPLY
0
Entering edit mode

Hi, I tried running your command and I got the error message "awk: line 2: function gensub never defined". Is there something else I need to do to the command/or my fasta file?

ADD REPLY
0
Entering edit mode

Maybe you need GNU awk. Are you on a Mac by chance? Then you'd need to install the GNU tools.

ADD REPLY
0
Entering edit mode

Try following with gnu-sed: For replacing every third base with appropriate symbol:

$ sed -e '/^>/! s/\(..\)[A,G]/\1R/Ig;s/\(..\)[T,C]/\1Y/Ig' test.fa

For replacing all the bases with appropriate symbol:

$ sed '/^>/! s/[A,G]/R/Ig;s/[T,C]/Y/Ig' test.fa
ADD REPLY
0
Entering edit mode

It appears that it is replacing nucleotides, but not every 3rd base (or it appears random to me). For example, ATG CTT AAC became ATR YTT RAY. I think the species names in the fasta sequences might be causing this problem, I see Y's and R's in the names. Can I get it to start and stop searching and replacing after each line break (there is a line break after the species name, and one at the end of the sequence)? Thank you for your help!

ADD REPLY
1
Entering edit mode

It is not random. It is working as per pattern. But pattern is incorrect or it must be constrained. Will post once i am done with it. Thanks. @OP

ADD REPLY
0
Entering edit mode

Thank you for your help, much appreciated!

ADD REPLY
0
Entering edit mode

Fasta files must be flattened (seqkit seq -w 0 input fasta for flattening fasta)

ADD REPLY
1
Entering edit mode

I am trying to find the best partitioning scheme for the data set.

How replacing purines to R and pyrimidines to Y will help find the best partitioning scheme? And what do you mean by "best partitioning scheme"?

ADD REPLY
2
Entering edit mode
5.8 years ago
h.mon 35k

I found the reason why to replace purines by R and pyrimidines by Y, and a sweetly named script to do so (RYplace.py), here:

https://github.com/ballesterus/Utensils

ADD COMMENT

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6