Question

Sequence Length Distribution From A Fastq File

16

Entering edit mode

10.9 years ago

deepthithomaskannan ▴ 380

Hi all,

Any script available for doing sequence length disrtibution of fastq file?

Thanks, Deepthi

sequence length fastq • 84k views

ADD COMMENT • link updated 5.0 years ago by LuckyLuck • 0 • written 10.9 years ago by deepthithomaskannan ▴ 380

0

Entering edit mode

Hello,

I am trying to do something similar, were I want to determine both the read length and how many reads are in my fastq file. Here is my code:

gunzip -c SRR1060507_1.fastq.gz|awk 'NR%4==2{printlength($0)}'|uniq -c

But I keep getting the following error:

awk: cmd. line:1: (FILENAME=- FNR=2) fatal: function `printlength' not defined

I'm not sure what I've done incorrectly? I also tried Frederic's code above and although I got that to run, its not exactly the output I'm seeking, I should be returning something like 2420797 100

Any help would be super appreciated!

ADD REPLY • link 6.2 years ago by oars ▴ 200

2

Entering edit mode

Between print and length should be at least one space. If you want to compute the number of reads simultaneously, use something like Frèdèric's approach

awk 'NR%4 == 2 {lengths[length($0)]++ ; counter++} END {for (l in lengths) {print l, lengths[l]}; print "total reads: " counter}' file.fastq

ADD REPLY • link 6.2 years ago by michael.ante ★ 3.8k

0

Entering edit mode

Thanks! This was a winner.

ADD REPLY • link 6.2 years ago by oars ▴ 200

2

Entering edit mode

10.9 years ago

Manu Prestat 4.1k

You can do this using fastx toolkit, especially, you might be interested in this program:

fastx_nucleotide_distribution_graph.sh

and maybe:

fastx_quality_stats
fastq_quality_boxplot_graph.sh

ADD COMMENT • link updated 5.0 years ago by Ram 43k • written 10.9 years ago by Manu Prestat 4.1k

1

Entering edit mode

i think this does not provide length distribution information

ADD REPLY • link 8.4 years ago by Malcolm.Cook ★ 1.5k

1

Entering edit mode

10.9 years ago

Ashutosh Pandey 12k

Use this post as a help Fastq Quality Read and Score Length Check and apply a few basic unix commands like sort and uniq to get the frequency.

ADD COMMENT • link updated 5.0 years ago by Ram 43k • written 10.9 years ago by Ashutosh Pandey 12k

Ram · Accepted Answer · 2013-05-22

38

Entering edit mode

10.9 years ago

Frédéric Mahé ★ 3.2k

Here is a solution using awk:

awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}' file.fastq

It reads like this: every second line in every group of 4 lines (the sequence line), measure the length of the sequence and increment the array cell corresponding to that length. When all lines have been read, loop over the array to print its content. In awk, arrays and array cells are initialized when they are called for the first time, no need to initialize them before.

ADD COMMENT • link updated 5.0 years ago by Ram 43k • written 10.9 years ago by Frédéric Mahé ★ 3.2k

1

Entering edit mode

Thanks for teaching me the NR%4 line grouping method: pretty useful indeed when dealing with fastq files.

ADD REPLY • link updated 5.0 years ago by Ram 43k • written 10.9 years ago by Manu Prestat 4.1k

Ram · Accepted Answer · 2013-08-07

19

Entering edit mode

10.7 years ago

bastienllamas ▴ 190

Another awk solution, very similar to Frédéric's:

cat reads.fastq | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > read_length.txt

And to quickly obtain a graph in R:

reads<-read.csv(file="read_length.txt", sep="", header=FALSE)
plot (reads$V2,reads$V1,type="l",xlab="read length",ylab="occurences",col="blue")

ADD COMMENT • link updated 5.0 years ago by Ram 43k • written 10.7 years ago by bastienllamas ▴ 190

4

Entering edit mode

no need to use cat

ADD REPLY • link 10.7 years ago by PoGibas 5.1k

Ram · Accepted Answer · 2013-05-22

11

Entering edit mode

10.9 years ago

Jordan ★ 1.3k

You can use FASTQC. It gives you all the details you need, including the quality, overrepresented sequences, enriched k-mers etc.

ADD COMMENT • link updated 5.0 years ago by Ram 43k • written 10.9 years ago by Jordan ★ 1.3k

0

Entering edit mode

Can you use the FASTQC on nanopore reads?

ADD REPLY • link 4.3 years ago by lovegenetics ▴ 70

0

Entering edit mode

You could do that but you may want to use NanoPlot instead. It will generate stats that are specific for nanopore sequencing and thus more useful.

ADD REPLY • link 4.3 years ago by GenoMax 141k

score 10 · Accepted Answer · 2017-04-24

10

Entering edit mode

7.0 years ago

Brian Bushnell 20k

Another possibility, from BBMap:

readlength.sh in=reads.fq out=histogram.txt

The default is 10bp bins with a max of 80kbp, but those can be configured (run the shellscript with no arguments for details). It's very fast, and handles fastq/fasta/sam/bam; raw/gzip/bzip2.

ADD COMMENT • link 7.0 years ago by Brian Bushnell 20k

0

Entering edit mode

Really useful! Only if I want to parse the output (e.g. pipe it into grep) I have to first suppress the stderr message: Processed XY reads. Time: Z seconds. Might it be more convenient to have this info also in stdout?

ADD REPLY • link 4.6 years ago by al-ash ▴ 200

score 2 · Accepted Answer · 2017-04-24

Save it as fastq_lenght.py and run it :)

#!/usr/bin/env python
#-*- coding: UTF-8 -*-
import sys
######################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python fastq_lenght.py file.fastq 
result: .txt file same name as input name plus "_lenghts.txt" 
------------------------------------------------------------------------------------
'''
######################################################################################

if len(sys.argv) != 2:
    print syntax
    sys.exit()

######################################################################################

fastq_file = open(sys.argv[1], 'r')
prefix = sys.argv[1].split('.')[0]
outfile = open(prefix + '_' + 'lenghts.txt', 'w')
seq = ''
for line in fastq_file:
    line = line.rstrip('\n')
    if line.startswith('@'):
        if seq:
            outfile.write(name + '\t' + str(len(seq)) + '\n')
            seq = ""
        name = line 
    else:
        seq = line 
outfile.write(name + '\t' + str(len(seq)) + '\n')
fastq_file.close()
outfile.close()

print '\n' + '\t' + 'File: ' + prefix + '_' + 'lenghts.txt has been created...'