Tool for calculating sequence statistic values (e.g. N50) for multiple bins
2
1
Entering edit mode
7.2 years ago
rororo ▴ 10

Does someone know a script to calculate sequence statistics such as N50, number of contigs, contigs length etc for multiple bins? So far, the genome tools seqstat script only calculates this for one bin.

binning Assembly statistics • 22k views
ADD COMMENT
0
Entering edit mode

perfect! format=5 did the trick for me thanks!

ADD REPLY
0
Entering edit mode

You're welcome! Just note that I use "N50" to describe a number of contigs (since Number starts with N) and "L50" to describe a length in bp (since Length starts with L). Some programs reverse that nomenclature for reasons that are opaque to me.

ADD REPLY
7
Entering edit mode
7.2 years ago

The BBMap package has a tool "stats.sh" for calculating these statistics on an individual fasta. It also has another tool, "statswrapper.sh", that will calculate the statistics for multiple fasta files and output this information (assembly size, N50, L50, number of contigs, GC%, etc) as one tab-delimited line per fasta. I actually wrote it for comparing different assemblies of the same data, but it works for this purpose as well.

ADD COMMENT
2
Entering edit mode

Hi Brian, Just a comment. I used the wrapper you mentioned above. I get the output as following:

Main genome scaffold N/L50:             3/804334
Main genome contig N/L50:                3/804334

It appears N and L values should be other way around : N/L50: 804334/3

ADD REPLY
0
Entering edit mode

Still an issue - I think, confused me for a while.

ADD REPLY
0
Entering edit mode

Hi! I wanted to know if you could give an example on its usage, like if I have various .fna files, how I can run from the bash command? I already installed the BBMap package on my local computer/conda environment.

ADD REPLY
0
Entering edit mode

@biohacker_tobe I do not know if you still need the answer but the command for stats.sh on linux terminal is like:

./stats.sh in=<file>. Parameters: in=fasta file

ADD REPLY
2
Entering edit mode
7.2 years ago

https://github.com/sanger-pathogens/assembly-stats

Example

$ assembly-stats Pf3D7_v3.fasta
stats for Pf3D7_v3.fasta
sum = 23328019, n = 16, ave = 1458001.19, largest = 3291936
N50 = 1687656, n = 5
N60 = 1472805, n = 7
N70 = 1445207, n = 8
N80 = 1343557, n = 10
N90 = 1067971, n = 12
N100 = 5967, n = 16
N_count = 0
Gaps = 0
ADD COMMENT

Login before adding your answer.

Traffic: 2713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6