Qc And Coverage From Paired-End Reads
5
1
Entering edit mode
13.0 years ago

I have paired-end exome data (X_R1.fastq and X_R2.fastq) for several samples. I would like to run quality check and get coverage for these data.

I am wondering whether I should run fastx_quality_stats on the individual fastq files directly or I need to merge individual reads to a single file before doing QC.

What tool do you use to merge paired end reads ? I am not sure if the solution in this question is ideal for all my data.

What is the best practice in QC, to perform QC on individual reads or merged reads ?

How can I get coverage from paired end reads ?

next-gen sequencing coverage • 9.4k views
ADD COMMENT
5
Entering edit mode
13.0 years ago
brentp 24k

If you have annotated genome, you can calculate exon coverage with either HTSeqlink text which has a command-line script that looks like:

htseq-count your.sam exons.gff  > feature_counts.txt

you can also use bedtools to do the same, see the example there that looks like this:

samtools view -bf 0x2 reads.bam | bamToBed -i stdin | coverageBed -a stdin -b exons.bed > exons.bed.proper.coverage

to calculate the coverage of aligned and paired reads over exons.

For merging, unless your aligner requires it, that's not necessary. You can do QC on the ends separately (and see large differences). For trimming and then discarding reads/pairs, you do want to make sure you discard from both or neither ends as nearly all aligners assume the fastq ands are in the same order.

ADD COMMENT
2
Entering edit mode

Depends on the organism. if it's available at UCSC, http://genome.ucsc.edu/cgi-bin/hgTables?command=start, then choose your organism and version and knownGenes (or whichever) then choose output format BED. Then in the next page: Exons plus 0 bases.

ADD REPLY
2
Entering edit mode

Note that coverageBed now takea BAM directly. samtools view -uf 0x2 reads.bam | coverageBed -abam stdin -b exons.bed

I need to update the usage examples...

ADD REPLY
0
Entering edit mode

Thanks a lot for the details Brent, this is what I am looking for. How do I get that exons.bed ?

ADD REPLY
0
Entering edit mode

Many thanks ! I am working on humans. So it's there, am downloading now.

ADD REPLY
0
Entering edit mode

I am trying the samtools command and getting a warning and the program exits "It looks as though you have less than 3 columns at line: 1. Are you sure your files are tab-delimited?" Have you come across something like this before ?

ADD REPLY
0
Entering edit mode

which part is creating that error? the samtools view? your data is in bam format, not sam?

ADD REPLY
0
Entering edit mode

Oops. I think i used a wrong input file. It's running now. Thanks.

ADD REPLY
0
Entering edit mode

Thanks a lot Aaron !

ADD REPLY
3
Entering edit mode
13.0 years ago

Brent had a blog post on filtering paired end reads, could be a good start:

http://hackmap.blogspot.com/2010/09/filtering-paired-end-reads-high.html

As for the coverage question. One approach is to reconstitute the fragments from the read pairs and use these to compute the coverage.

ADD COMMENT
0
Entering edit mode

Thanks Istavan !

ADD REPLY
0
Entering edit mode
ADD REPLY
3
Entering edit mode
9.1 years ago
Czh3 ▴ 190

filter paired end reads, you may try thishttps://github.com/Czh3/NGSTools/blob/master/qualityControl.py.

usage: qualityControl.py [-h] -1 FASTQ1 [-2 FASTQ2] [-q MINQUALITY]
                         [-p MINPERCENT] [-a AVERAGEQUALITY] [-f {33,64}] -o1
                         OUTFILE1 [-o2 OUTFILE2]

Illumina Single-end or Pair-end sequencing quality control
tools.<zhangchao3@hotmail.com>

optional arguments:
  -h, --help            show this help message and exit
  -1 FASTQ1, --fastq1 FASTQ1
                        The input fastq1 file.
  -2 FASTQ2, --fastq2 FASTQ2
                        The input fastq2 file.
  -q MINQUALITY, --minQuality MINQUALITY
                        Minimum quality score to keep.
  -p MINPERCENT, --minPercent MINPERCENT
                        Minimum percent of bases that must have [-q] quality.
  -a AVERAGEQUALITY, --averageQuality AVERAGEQUALITY
                        The average quality of SE/PE reads less than the
                        averageQuality will be descard.
  -f {33,64}, --phredQualtiy {33,64}
                        phred quality score
  -o1 OUTFILE1, --outFile1 OUTFILE1
                        fastq1 output file.
  -o2 OUTFILE2, --outFile2 OUTFILE2
                        fastq2 output file.
ADD COMMENT
1
Entering edit mode
13.0 years ago

Heng Li has recently added new functionality to samtools (0.1.15) that will compute coverage based on a BED file ("-L" option). I believe it uses a BAM index file to seek to the appropriate regions via a disk seek. Depending on your disks, exons are probably near the point (N=~250,000 for knownGene transcripts) where it would be faster to use BEDTools coverageBed. My guess is that the speed plots for the two approaches intersect when the BED file has between 1e5 and 1e6 features.

Either way, samtools will use less memory if that is a concern. I don't think it tells you anything about the "shape" of the coverage, whereas coverageBed does.

This is a long way of telling you that you have lots of good options.

ADD COMMENT
1
Entering edit mode

For -L samtools does the bedtools way: index BED and read through alignments. Nonetheless, the released version is buggy. If we want to use the BAM index, we may use awk+xargs, which will be very slow given >100,000 exones. The best approach is to combine the two together: jump with the BAM index when two BED intervals are far apart and read through otherwise. To store and index BED, samtools uses 8 bytes per interval. I do not worry about the memory, though it may not perform well for some BEDs.

ADD REPLY
1
Entering edit mode

Thanks for clarifying the algorithm, Heng.

ADD REPLY
0
Entering edit mode

Thanks Aaron, Heng !

ADD REPLY
1
Entering edit mode
8.7 years ago
madk00k ▴ 360

Qualimap computes coverage and many other QC issues from BAM file. Here's the link:

http://qualimap.bioinfo.cipf.es/

ADD COMMENT

Login before adding your answer.

Traffic: 3138 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6