Count the length of a sequence and sort it into a multifasta file
1
0
Entering edit mode
3.2 years ago
diego1530 ▴ 80

Biostars, I'm here to request support for the following situation.

In general, to make it easier because I'm a beginner in programming, I have written this script based on Biopython

from Bio import SeqIO

#Get the lengths and ids, and sort on length

len_and_ids = sorted((len(rec), rec.id)
for rec in SeqIO.parse("mycobt.fasta","fasta"))
ids = reversed([id for (length, id) in len_and_ids])
del len_and_ids #free this memory
record_index = SeqIO.index("mycobt.fasta", "fasta")
records = (record_index[id] for id in ids)
SeqIO.write(records, "sorted.fasta", "fasta")

This code is working correctly and as a result it sorts the sequences from smallest to largest, including the nucleotides in the following way:

>Myco_seq3
ATCCAA

>Myco_seq4
GCCTCATCTC

>Myco_seq2
ATCGGTCATCCGATAA

>Myco_seq1
TCCAGCCTTCCTCATGGCTCCCA

Instead, what I really want is that the result only shows me the ID of the fasta with its respective length in number from smallest to largest, as in the following example:

> Myco_seq3 | 6

>Myco_seq4 | 10

>Myco_seq2 | 16

>Myco_seq1 | 23

I've looked through the forums, but only see ones where people filter by specific length. Could someone help me here? I would highly appreciate your contribution and support.

Biopython Sequence Count Length • 833 views
ADD COMMENT
0
Entering edit mode
3.2 years ago
Mensur Dlakic ★ 27k

Try replacing the last line in your script with these lines:

out_fasta = open("sorted.fasta", "w")
for rec in records:
    out_fasta.write('>%s | %d\n' % ( rec.id, len(rec)))
out_fasta.close()
ADD COMMENT

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6