Question

How to count fasta sequences efficiently using (or not ) biopython

2

Entering edit mode

6.2 years ago

juan.crescente ▴ 110

This is not a very memory friendly way of counting sequences from a multi fasta, any ideas to improve this?

generator = SeqIO.parse("test_fasta.fasta","fasta")
sizes = [len(rec) for rec in SeqIO.parse("test_fasta.fasta", "fasta")]

I'm avoiding using tools like grep since I want to make this more portable

biopython python fasta • 18k views

ADD COMMENT • link updated 5 months ago by Mark ★ 1.5k • written 6.2 years ago by juan.crescente ▴ 110

0

Entering edit mode

How to cont fasta

Count/concatenate/check length?

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Count as in the description, also could have guessed from the example. PD title updated

ADD REPLY • link 6.2 years ago by juan.crescente ▴ 110

0

Entering edit mode

bioawk? github.com/lh3/bioawk

ADD REPLY • link 6.2 years ago by Ram 43k

5

Entering edit mode

6.2 years ago

tiago211287 ★ 1.4k

Count the number of sequences in 1 fasta file:

grep -c ">" file.fasta

Count the number of sequences in several fasta files:

find /home/folder/to/file/ -name "*.fasta" | parallel grep -Hc ">" {}

Get the length of each instance of a fasta file:

grep -v ">" 180110.fasta | awk '{print length}'

ADD COMMENT • link 6.2 years ago by tiago211287 ★ 1.4k

1

Entering edit mode

6.2 years ago

yhoogstrate ▴ 140

You could take a look at this python lib: https://github.com/mdshw5/pyfaidx. It makes use of .fai files or may generate one. It is also compatible with gziped fasta's.

ADD COMMENT • link 6.2 years ago by yhoogstrate ▴ 140

1

Entering edit mode

2.3 years ago

Carlos Caicedo ▴ 210

Using grep in Python

import subprocess 
subprocess.call(['grep', '-c', '>', 'test_fasta.fasta'])

ADD COMMENT • link 2.3 years ago by Carlos Caicedo ▴ 210

0

Entering edit mode

2.3 years ago

onestop_data ▴ 330

I think using pysam is the fastest way to read a FASTA/FASTQ file - especially large files. You can use it to read the FASTA file and count the number of sequences - https://onestopdataanalysis.com/read-fasta-file-python/

ADD COMMENT • link 2.3 years ago by onestop_data ▴ 330

0

Entering edit mode

5 months ago

Mark ★ 1.5k

According to this SO post: https://stackoverflow.com/questions/393053/length-of-generator-output

This is the fastest:

len(list(generator))

This assumes that each fasta sequence is a record, so counting the records should equal the number of sequences. One issue is that this method will consume your generator.

ADD COMMENT • link 5 months ago by Mark ★ 1.5k

score 6 · Accepted Answer · 2018-01-24

6

Entering edit mode

6.2 years ago

Andrzej Zielezinski 11k

Standard Python will be faster than BioPython:

fh = open("test_fasta.fasta")
n = 0
for line in fh:
    if line.startswith(">"):
        n += 1
fh.close()

or shorter and possibly faster:

num = len([1 for line in open("test_fasta.fasta") if line.startswith(">")])

ADD COMMENT • link 6.2 years ago by Andrzej Zielezinski 11k