Biostar Beta. Not for public use.
Any ideas on converting IUPAC ambiguity codes in a fasta file
0
Entering edit mode
18 months ago
SaltedPork • 100

Hi, I have fasta files that contain ambiguity codes. And I would like to convert the sequences so that it's just ACGT and create a new entry for each possibility. i.e. Y = C or T

>sampleID
AGCTAGYACG


into

>sampleID.1
AGCTAGCACG
>sampleID.2
AGCTAGTACG


I was thinking about writing a Perl script that does that using the transliterate function (tr). Before I embark on this, has anyone already written or used some code/software that does this?

2
2
Entering edit mode
20 months ago
st.ph.n ♦ 2.5k

This python code should work if you FASTA is single-line. If not you can linearize it.

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
with open(sys.argv[2], 'w') as out:
for line in f:
if line.startswith(">"):
seq = next(f).strip()
if 'Y' in seq:
out.write(header + '.1' + '\n' + 'C'.join(seq.split('Y')))
out.write(header + '.2' + '\n' + 'T'.join(seq.split('Y')))
else:


Save as convert.py, run as python convert.py input.fasta output.fasta

EDIT

IUPAC_ambiguity_codes.txt

M       A       C
R       A       G
W       A       T
S       C       G
Y       C       T
K       G       T
V       A       C       G
H       A       C       T
D       A       G       T
B       C       G       T
N       G       A       T       C

#!/usr/bin/env python
import sys
from collections import defaultdict

with open('IUPAC_ambiguity_codes.txt', 'r') as amb:
codes = defaultdict(list)
for line in amb:
for line in f:
codes[line.strip().split('\t')[0]] = line.strip().split('\t')

with open(sys.argv[1], 'r') as f:
with open(sys.argv[2], 'w') as out:
for line in f:
if line.startswith(">"):
seq = next(f).strip()
for c in codes:
if c in seq:
for n in range(len(codes[c])):
out.write(header + '.' + str(n+1) + '\n' + codes[c][n].join(seq.split(c)))


Untested, but should output a new numbered sequence header for each replacement required to make for each ambiguity code encountered.

1
Entering edit mode

I am reasonably sure OP wants this to work for all valid IUPAC codes.

0
Entering edit mode

Yep, that would be the idea, I suppose I could take this and have many if statements, but then how do I account for sequences which have multiple IUPAC codes in them?

1
Entering edit mode

You sure can. If you want something that works now then use one of the stackoverflow or @Pierre's solution below.

1
Entering edit mode

It appears as though you want a new sequence for each ambiguity code encountered. See edited post for reference. Otherwise if you want a new sequence and ALL codes changed at once, it will be slightly different.

2
Entering edit mode
18 months ago
France/Nantes/Institut du Thorax - INSE…
$gcc biostars282490.c$ ./a.out AGCTAGYACG
AGCTAGCACG
AGCTAGTACG