2.9 years ago
@st.ph.n10041
This python code should work if you FASTA is single-line. If not you can linearize it.
#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
with open(sys.argv[2], 'w') as out:
for line in f:
if line.startswith(">"):
header = line.strip()
seq = next(f).strip()
if 'Y' in seq:
out.write(header + '.1' + '\n' + 'C'.join(seq.split('Y')))
out.write(header + '.2' + '\n' + 'T'.join(seq.split('Y')))
else:
out.write(header + '\n' + seq)
Save as convert.py
, run as python convert.py input.fasta output.fasta
EDIT
IUPAC_ambiguity_codes.txt
M A C
R A G
W A T
S C G
Y C T
K G T
V A C G
H A C T
D A G T
B C G T
N G A T C
#!/usr/bin/env python
import sys
from collections import defaultdict
with open('IUPAC_ambiguity_codes.txt', 'r') as amb:
codes = defaultdict(list)
for line in amb:
for line in f:
codes[line.strip().split('\t')[0]] = line.strip().split('\t')
with open(sys.argv[1], 'r') as f:
with open(sys.argv[2], 'w') as out:
for line in f:
if line.startswith(">"):
header = line.strip()
seq = next(f).strip()
for c in codes:
if c in seq:
for n in range(len(codes[c])):
out.write(header + '.' + str(n+1) + '\n' + codes[c][n].join(seq.split(c)))
Untested, but should output a new numbered sequence header for each replacement required to make for each ambiguity code encountered.
I am reasonably sure OP wants this to work for all valid IUPAC codes.
Yep, that would be the idea, I suppose I could take this and have many if statements, but then how do I account for sequences which have multiple IUPAC codes in them?
You sure can. If you want something that works now then use one of the stackoverflow or @Pierre's solution below.
It appears as though you want a new sequence for each ambiguity code encountered. See edited post for reference. Otherwise if you want a new sequence and ALL codes changed at once, it will be slightly different.