How to deal with the spaces in the sequence names with biopython?
3
0
Entering edit mode
8.9 years ago
grayapply2009 ▴ 280

I have a fasta file formatted as follows:

>UPF0471 protein C1orf63 homolog

some sequence

>WD repeat-containing protein 43

some sequence

>transmembrane protein 41A

some sequence

When I print out record.id or make dictionaries, biopython cannot handle the spaces in the sequence names. What should I do to let biopython recognize the name as whole rather than just taking the first word of the name?

sequence space name biopython • 3.0k views
ADD COMMENT
0
Entering edit mode

Replace the spaces with "_" or "-"?

ADD REPLY
0
Entering edit mode

You'll find most tools will take the same attitude to spaces and FASTA identifiers, so good idea!

ADD REPLY
2
Entering edit mode
8.9 years ago

You can get the whole header by using record.description

ADD COMMENT
1
Entering edit mode
8.9 years ago
Peter 6.0k

Answering your second question, how to make a dictionary using SeqIO.to_dict with the full descriptions with spaces as keys - you would need to use the key_function as help(to_dict) tries to explain, e.g.

my_dict = to_dict(sequences, key_function=lambda rec: rec.description)

ADD COMMENT
0
Entering edit mode
8.9 years ago
grayapply2009 ▴ 280

Then how do I make dictionaries with SeqIO.to_dict?

ADD COMMENT
0
Entering edit mode

This isn't an answer - it is a new question, or an addendum to your old question?

ADD REPLY

Login before adding your answer.

Traffic: 2279 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6