Biostar Beta. Not for public use.
Any ideas on converting IUPAC ambiguity codes in a fasta file
0
Entering edit mode
2.9 years ago
SaltedPork • 100
@SaltedPork32398

Hi, I have fasta files that contain ambiguity codes. And I would like to convert the sequences so that it's just ACGT and create a new entry for each possibility. i.e. Y = C or T

>sampleID
AGCTAGYACG

into

>sampleID.1
AGCTAGCACG
>sampleID.2
AGCTAGTACG

I was thinking about writing a Perl script that does that using the transliterate function (tr). Before I embark on this, has anyone already written or used some code/software that does this?

fasta iupac regex perl bash • 869 views
ADD COMMENTlink
2
Entering edit mode
2.9 years ago
st.ph.n ♦ 2.5k
@st.ph.n10041

This python code should work if you FASTA is single-line. If not you can linearize it.

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
    with open(sys.argv[2], 'w') as out:
        for line in f:
            if line.startswith(">"):
                header = line.strip()
                seq = next(f).strip()
                if 'Y' in seq:
                    out.write(header + '.1' + '\n' + 'C'.join(seq.split('Y')))
                    out.write(header + '.2' + '\n' + 'T'.join(seq.split('Y')))
                else:
                    out.write(header + '\n' + seq)

Save as convert.py, run as python convert.py input.fasta output.fasta

EDIT

IUPAC_ambiguity_codes.txt

M       A       C
R       A       G
W       A       T
S       C       G
Y       C       T
K       G       T
V       A       C       G
H       A       C       T
D       A       G       T
B       C       G       T
N       G       A       T       C
  
#!/usr/bin/env python
import sys
from collections import defaultdict

with open('IUPAC_ambiguity_codes.txt', 'r') as amb:
        codes = defaultdict(list)
        for line in amb:
                for line in f:
                        codes[line.strip().split('\t')[0]] = line.strip().split('\t')

with open(sys.argv[1], 'r') as f:
    with open(sys.argv[2], 'w') as out:
        for line in f:
            if line.startswith(">"):
                header = line.strip()
                seq = next(f).strip()
                for c in codes:
                        if c in seq:
                                for n in range(len(codes[c])):
                                        out.write(header + '.' + str(n+1) + '\n' + codes[c][n].join(seq.split(c)))

Untested, but should output a new numbered sequence header for each replacement required to make for each ambiguity code encountered.

ADD COMMENTlink
1
Entering edit mode

I am reasonably sure OP wants this to work for all valid IUPAC codes.

ADD REPLYlink
0
Entering edit mode

Yep, that would be the idea, I suppose I could take this and have many if statements, but then how do I account for sequences which have multiple IUPAC codes in them?

ADD REPLYlink
1
Entering edit mode

You sure can. If you want something that works now then use one of the stackoverflow or @Pierre's solution below.

ADD REPLYlink
1
Entering edit mode

It appears as though you want a new sequence for each ambiguity code encountered. See edited post for reference. Otherwise if you want a new sequence and ALL codes changed at once, it will be slightly different.

ADD REPLYlink
2
Entering edit mode
2.9 years ago
@Pierre Lindenbaum30

see all posible sequences from consensus

$ gcc biostars282490.c
$ ./a.out AGCTAGYACG
AGCTAGCACG
AGCTAGTACG
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3