Biopython: SeqIO.write () function to write dictionary object to fasta file
3
1
Entering edit mode
5.9 years ago

Hello,

I am trying to write a dictionary object to a FASTA file however I have problems with writing it. I could not achieve doing it without using the library or with the library (Biopython).

I tried converting my dictionary to list using "dict.items()" then writing it with SeqIOand error is:

"AttributeError: 'tuple' object has no attribute 'id'"

I would appreciate any kind of help. Thanks in advance!

WM

biopython sequence • 14k views
ADD COMMENT
3
Entering edit mode
5.9 years ago
Eric Lim ★ 2.1k

The error is clear. Your dictionary is missing the id attribute, which is the required parameter to use SeqIO.write. Typically, you'd provide a SeqRecord object to it, which includes a Seq object with parameters like id and description. You can definitely turn your dictionary into a list of SeqRecord objects and everything else should work.

There are quite a few other ways to convert dictionary to the many formats SeqIO supports. The easiest (and the least programming experience required) is to simply write your dictionary into a tab-delimited file and use SeqIO.convert.

See below for an example.

from Bio import SeqIO

a = {'myseq1':'acgt', 'myseq2': 'gctc'}

# try writing your own code to turn this dictionary into a tab-delimited file (seq.tab), i.e
# myseq1  acgt
# myseq2  gctc

SeqIO.convert('seq.tab', 'tab', 'seq.fa', 'fasta')
ADD COMMENT
1
Entering edit mode
5.9 years ago
Joe 21k

All you need to do is tell it to write the dictionary values in the SeqIO.write() call, i.e.:

Input file (seqs.fa):

>RandomSequence_3c0u91QYYaQ6aKHbB3SnPOAeQhQnk8xn
ATGACGACGTCTGCACCTCTTCAGCGAGGGTATGACCACGTTGGTCAGCCGGACCGAGCC
AATCGAGCTTGGTGGAACAA
>RandomSequence_CjetvXyAxJ5P1lQVcArbgNTvHpJHRvvv
CGATAGCAGCACACGGCGGGCCACCCATCATAGACTCCGGCGTTCAGGGCCGTATCAATT
GAGTCGAAGCTGAAACGTCA
>RandomSequence_vUqogYyedda55EPajRhHdQNHncrPmzc5
TTGAACTGGTGGACTATGCCGCCGAGGACGCCCGCGTAGAAATACCGTTCAACCTTTGCA
TCAATAAGAGTCAAATGTTA

Run through the following code:

from Bio import SeqIO

record_dict = SeqIO.to_dict(SeqIO.parse('seqs.fa', 'fasta'))

with open('output_fasta.fa', 'w') as handle:
    SeqIO.write(record_dict.values(), handle, 'fasta')

Yields this output file (output_fasta.fa): - spoiler alert, it's the same as the input file (duh!) :)

>RandomSequence_3c0u91QYYaQ6aKHbB3SnPOAeQhQnk8xn
ATGACGACGTCTGCACCTCTTCAGCGAGGGTATGACCACGTTGGTCAGCCGGACCGAGCC
AATCGAGCTTGGTGGAACAA
>RandomSequence_CjetvXyAxJ5P1lQVcArbgNTvHpJHRvvv
CGATAGCAGCACACGGCGGGCCACCCATCATAGACTCCGGCGTTCAGGGCCGTATCAATT
GAGTCGAAGCTGAAACGTCA
>RandomSequence_vUqogYyedda55EPajRhHdQNHncrPmzc5
TTGAACTGGTGGACTATGCCGCCGAGGACGCCCGCGTAGAAATACCGTTCAACCTTTGCA
TCAATAAGAGTCAAATGTTA
ADD COMMENT
1
Entering edit mode

This is with the assumption that the OP obtained the dictionary directly from SeqIO where the dictionary's values() already has everything formatted nicely in SeqRecord. The attribute error suggested that the dictionary is likely from outside of the biopython's ecosystem. Your example is extremely educational for someone to learn how to construct SeqRecord objects from any key-value data structure by looking at the output of print(record_dict).

ADD REPLY
0
Entering edit mode
5.9 years ago

Dear Eric Lim and jrj.healey,

Thank you for your answers however it seems to be that the problem is a little bit more complicated to solve hence I've been trying just to get the fasta output of my file for 2 days. I'm new to python but It took a lot more to solve this single problem than writing the whole code. I'm sorry if i could not be more specific and clear. To start, a piece from my code is below:

for seq_record in SeqIO.parse("/home/june/Desktop/snp/ssr/transcriptome.fasta", "fasta"):
    if seq_record.id in str(dicti.keys()).strip(">"):
        dicti[">" + seq_record.id] = dicti[">" + seq_record.id] + str(seq_record.seq)

In the end I get my output in the format of:

Transcript345: Motif: GC Total length: 20 CAAGGTCAGGCCTTCTTTATGCATGATAAGCACTGTGAGGACCCAGGGCAGCTTCAGTGATCATCAGGTGAGTTTAAGGTGGGGGGGGGGGGGCT

However I need it in the FASTA format.

And my dictionary is in the format of:

'>Transcript345': 'Motif: GC Total length: 20 CAAGGTCAGGCCTTCTTTATGCATG... ' #transcript id as the key and the rest as the value

Please ask me questions if i still am not clear enough.

Thanks in advance! WM

ADD COMMENT
1
Entering edit mode

You should add a reply to your original question instead of replying as an answer.

And no, I'm not sure I follow what exactly you're trying to do and I'm quite certain str(dicti.keys()).strip(">")) isn't doing what you think it's doing.

Your code hinted you're trying to output a subset of transcriptome.fasta whose record ids match to your dictionary keys, but It's probably more helpful if you can post an example of your dictionary, a line or two of your input, and your desired output.

ADD REPLY
0
Entering edit mode

This should not be an answer.

Your post formatting is a little screwed, but if the format is what I think it is, how is it not in fasta format?

If you've got too many > characters, which it looks like you might have, just remove them from the string concatenation you're doing...

Please post more of the code and your input so we can see the issue more clearly.

ADD REPLY

Login before adding your answer.

Traffic: 2988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6