Hi all,
Any script available for doing sequence length disrtibution of fastq file?
Thanks, Deepthi
Hi all,
Any script available for doing sequence length disrtibution of fastq file?
Thanks, Deepthi
Here is a solution using awk
:
awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}' file.fastq
It reads like this: every second line in every group of 4 lines (the sequence line), measure the length of the sequence and increment the array cell corresponding to that length. When all lines have been read, loop over the array to print its content. In awk
, arrays and array cells are initialized when they are called for the first time, no need to initialize them before.
Another awk solution, very similar to Frédéric's:
cat reads.fastq | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > read_length.txt
And to quickly obtain a graph in R:
reads<-read.csv(file="read_length.txt", sep="", header=FALSE)
plot (reads$V2,reads$V1,type="l",xlab="read length",ylab="occurences",col="blue")
You can use FASTQC. It gives you all the details you need, including the quality, overrepresented sequences, enriched k-mers etc.
Another possibility, from BBMap:
readlength.sh in=reads.fq out=histogram.txt
The default is 10bp bins with a max of 80kbp, but those can be configured (run the shellscript with no arguments for details). It's very fast, and handles fastq/fasta/sam/bam; raw/gzip/bzip2.
Save it as fastq_lenght.py and run it :)
#!/usr/bin/env python
#-*- coding: UTF-8 -*-
import sys
######################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python fastq_lenght.py file.fastq
result: .txt file same name as input name plus "_lenghts.txt"
------------------------------------------------------------------------------------
'''
######################################################################################
if len(sys.argv) != 2:
print syntax
sys.exit()
######################################################################################
fastq_file = open(sys.argv[1], 'r')
prefix = sys.argv[1].split('.')[0]
outfile = open(prefix + '_' + 'lenghts.txt', 'w')
seq = ''
for line in fastq_file:
line = line.rstrip('\n')
if line.startswith('@'):
if seq:
outfile.write(name + '\t' + str(len(seq)) + '\n')
seq = ""
name = line
else:
seq = line
outfile.write(name + '\t' + str(len(seq)) + '\n')
fastq_file.close()
outfile.close()
print '\n' + '\t' + 'File: ' + prefix + '_' + 'lenghts.txt has been created...'
You can do this using fastx toolkit, especially, you might be interested in this program:
fastx_nucleotide_distribution_graph.sh
and maybe:
fastx_quality_stats
fastq_quality_boxplot_graph.sh
Use this post as a help Fastq Quality Read and Score Length Check and apply a few basic unix commands like sort and uniq to get the frequency.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I'm surprised anyone answered this, given that you present no evidence of having looked for a solution yourself.
Maybe just let the community decide on whether a question is worthy of answering or not (unless of course it is completely unsuitable for Biostar). To most of us who have been at this a while his question is trivial, but to him... Lets not make people feel bad about asking questions.
I don't care if questions seem "trivial". I do care about questions that begin "is there any script" - it simply implies that they have not even tried a Web search before asking.
Good point: I'm as a consequence surprised by myself...
Embrace the non-Googlers of Biostars
Hello,
I am trying to do something similar, were I want to determine both the read length and how many reads are in my fastq file. Here is my code:
But I keep getting the following error:
I'm not sure what I've done incorrectly? I also tried Frederic's code above and although I got that to run, its not exactly the output I'm seeking, I should be returning something like 2420797 100
Any help would be super appreciated!
Between print and length should be at least one space. If you want to compute the number of reads simultaneously, use something like Frèdèric's approach
Thanks! This was a winner.