Question

Count the length of a sequence and sort it into a multifasta file

0

Entering edit mode

3.2 years ago

diego1530 ▴ 80

Biostars, I'm here to request support for the following situation.

In general, to make it easier because I'm a beginner in programming, I have written this script based on Biopython

from Bio import SeqIO

#Get the lengths and ids, and sort on length

len_and_ids = sorted((len(rec), rec.id)
for rec in SeqIO.parse("mycobt.fasta","fasta"))
ids = reversed([id for (length, id) in len_and_ids])
del len_and_ids #free this memory
record_index = SeqIO.index("mycobt.fasta", "fasta")
records = (record_index[id] for id in ids)
SeqIO.write(records, "sorted.fasta", "fasta")

This code is working correctly and as a result it sorts the sequences from smallest to largest, including the nucleotides in the following way:

>Myco_seq3
ATCCAA

>Myco_seq4
GCCTCATCTC

>Myco_seq2
ATCGGTCATCCGATAA

>Myco_seq1
TCCAGCCTTCCTCATGGCTCCCA

Instead, what I really want is that the result only shows me the ID of the fasta with its respective length in number from smallest to largest, as in the following example:

> Myco_seq3 | 6

>Myco_seq4 | 10

>Myco_seq2 | 16

>Myco_seq1 | 23

I've looked through the forums, but only see ones where people filter by specific length. Could someone help me here? I would highly appreciate your contribution and support.

Biopython Sequence Count Length • 833 views

ADD COMMENT • link updated 3.2 years ago by Mensur Dlakic ★ 27k • written 3.2 years ago by diego1530 ▴ 80

score 0 · Answer 1 · 2021-02-08

0

Entering edit mode

3.2 years ago

Mensur Dlakic ★ 27k

Try replacing the last line in your script with these lines:

out_fasta = open("sorted.fasta", "w")
for rec in records:
    out_fasta.write('>%s | %d\n' % ( rec.id, len(rec)))
out_fasta.close()

ADD COMMENT • link 3.2 years ago by Mensur Dlakic ★ 27k