Biostar Beta. Not for public use.
Any ideas on converting IUPAC ambiguity codes in a fasta file
0
Entering edit mode
18 months ago
SaltedPork • 100

Hi, I have fasta files that contain ambiguity codes. And I would like to convert the sequences so that it's just ACGT and create a new entry for each possibility. i.e. Y = C or T

>sampleID
AGCTAGYACG

into

>sampleID.1
AGCTAGCACG
>sampleID.2
AGCTAGTACG

I was thinking about writing a Perl script that does that using the transliterate function (tr). Before I embark on this, has anyone already written or used some code/software that does this?

ADD COMMENTlink
2
2
Entering edit mode
20 months ago
st.ph.n ♦ 2.5k
Philadelphia, PA

This python code should work if you FASTA is single-line. If not you can linearize it.

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
    with open(sys.argv[2], 'w') as out:
        for line in f:
            if line.startswith(">"):
                header = line.strip()
                seq = next(f).strip()
                if 'Y' in seq:
                    out.write(header + '.1' + '\n' + 'C'.join(seq.split('Y')))
                    out.write(header + '.2' + '\n' + 'T'.join(seq.split('Y')))
                else:
                    out.write(header + '\n' + seq)

Save as convert.py, run as python convert.py input.fasta output.fasta

EDIT

IUPAC_ambiguity_codes.txt

M       A       C
R       A       G
W       A       T
S       C       G
Y       C       T
K       G       T
V       A       C       G
H       A       C       T
D       A       G       T
B       C       G       T
N       G       A       T       C
  
#!/usr/bin/env python
import sys
from collections import defaultdict

with open('IUPAC_ambiguity_codes.txt', 'r') as amb:
        codes = defaultdict(list)
        for line in amb:
                for line in f:
                        codes[line.strip().split('\t')[0]] = line.strip().split('\t')

with open(sys.argv[1], 'r') as f:
    with open(sys.argv[2], 'w') as out:
        for line in f:
            if line.startswith(">"):
                header = line.strip()
                seq = next(f).strip()
                for c in codes:
                        if c in seq:
                                for n in range(len(codes[c])):
                                        out.write(header + '.' + str(n+1) + '\n' + codes[c][n].join(seq.split(c)))

Untested, but should output a new numbered sequence header for each replacement required to make for each ambiguity code encountered.

ADD COMMENTlink
1
Entering edit mode

I am reasonably sure OP wants this to work for all valid IUPAC codes.

ADD REPLYlink
0
Entering edit mode

Yep, that would be the idea, I suppose I could take this and have many if statements, but then how do I account for sequences which have multiple IUPAC codes in them?

ADD REPLYlink
1
Entering edit mode

You sure can. If you want something that works now then use one of the stackoverflow or @Pierre's solution below.

ADD REPLYlink
1
Entering edit mode

It appears as though you want a new sequence for each ambiguity code encountered. See edited post for reference. Otherwise if you want a new sequence and ALL codes changed at once, it will be slightly different.

ADD REPLYlink
2
Entering edit mode
18 months ago
France/Nantes/Institut du Thorax - INSE…

see all posible sequences from consensus

$ gcc biostars282490.c
$ ./a.out AGCTAGYACG
AGCTAGCACG
AGCTAGTACG
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1