Question

MiSeq reagent kit version 3 : bad reads qualities

2

Entering edit mode

7.0 years ago

sacha ★ 2.4k

we did a 16S RNA analysis using MiSeq kit version 3. This kit makes paired-end reads with a size of 300pb. By merging both reads, we can cover 535 pb of region V3-V5.

As you can see in the following plot, qualities of reads are really bad after the nucleotid 250. Is it normal ? Perhaps the strategy prefers reads lenghs to the quality ? Did you get something similiar with the same kit ?

V3-V5 region fastqt quality plot After merging with Flash

miseq illumina quality phred fastqt • 5.6k views

ADD COMMENT • link updated 7.0 years ago by h.mon 35k • written 7.0 years ago by sacha ★ 2.4k

2

Entering edit mode

This is a "normal" (par for the course) observation with v.3 2x300 MiSeq kits. You will find many threads about this on SeqAnswers: http://seqanswers.com/forums/showthread.php?t=59558 , http://seqanswers.com/forums/showthread.php?t=73926

ADD REPLY • link 7.0 years ago by GenoMax 141k

1

Entering edit mode

Thanks for posting the nice diagram of ribosomal variable regions!

If the q-scores can be trusted, this is actually not too bad. I mean, it's bad, but not unusable; I've seen far worse from 2x300 kits. You will get better quality from 2x250bp kits, but then you can't get the whole region. I'm not an expert on 16S so I can't offer advice on the best strategy or region, but overall this data looks promising.

I recommend that people sequence a known mock community comprised of known organisms with complete references to evaluate true positive/false positive/true negative/false negative rates of a methodology before employing it for real research. This would be less necessary if the quality scores were accurate, but they are not, and Illumina shows little interest in changing that (they've generally become less accurate over time).

You might be able to ascertain the true quality scores by assembling with a kmer longer than the length of the conserved regions, and mapping to the assembly. I;m not sure how well it will work, but it's worth a try. Something like this, using BBMap:

bbmerge.sh in1=r1.fq in2=r2.fq out=merged.fq qtrim2 
tadpole.sh in=merged.fq out=contigs.fa k=93
bbmap.sh in1=r1.fq in2=r2.fq mhist=mhist.txt qahist=qahist.txt pairedonly maxindel=100

If assembly works, that will tell how accurate your bases actually are. I have not tried it before, though, with amplicon data. Also, "k=93" might need to be increased to cover the longest conserved region, presumably the gap between V4 and V5 based on the diagram. If everything is identical there, it would need to be increased; if they're just mostly similar, it won't. You can try assembling with different kmer lengths and see which one gives the best results; ideally, you want lots of >500bp contigs.

You could also try mapping to Silva, if these are mostly already-sequenced organisms...

ADD REPLY • link 7.0 years ago by Brian Bushnell 20k

0

Entering edit mode

I think Illumina sequencing is not famous for long read lengths. I am afraid 300 bp is too much for MiSeq protocol. You will be better off with smaller reads (100-150?)

ADD REPLY • link 7.0 years ago by Benn 8.3k

0

Entering edit mode

The kit was purchased from illumina ! https://www.illumina.com/products/by-type/sequencing-kits/cluster-gen-sequencing-reagents/miseq-reagent-kit-v3.html

ADD REPLY • link 7.0 years ago by sacha ★ 2.4k

0

Entering edit mode

Yes they sell it, but I hear from users that it doesn't mean that the quality stays good at the end. Check out the posts @genomax2 pointed out.

ADD REPLY • link 7.0 years ago by Benn 8.3k

0

Entering edit mode

Typical values for PE-300 runs. But the lowest median Q scores for the merged reads are still >20 (99% accurate). That's sufficient for most applications - is it not for yours?

ADD REPLY • link 7.0 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

But I lost 40% of my reads by merging and cleaning with Q>10 .

ADD REPLY • link 7.0 years ago by sacha ★ 2.4k

0

Entering edit mode

How did you lose reads by merging? Long as you have a good reference to align to having sub-optimal Q scores may not be a big deal.

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

I merged my paired-end data with Flash . https://ccb.jhu.edu/software/FLASH/

  flash sample_F.fastq sample_R.fastq  > merged.fastq

merged.fastq contains 500pb read length . But only 60% reads are preserved.

ADD REPLY • link 7.0 years ago by sacha ★ 2.4k

0

Entering edit mode

Losing 40% is surprising. The first graph indicates that the vast majority of Q<10 bases are in the overlap region (236-300). Merging those should produce Q>10 values for most of those bases. I haven't used Flash; how does it calculate Q scores of overlapping bases?

ADD REPLY • link 7.0 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

Flash try to align the overloap between the forward and the reverse. If the alignement score is too low, the read is ignored. Here is the output of flash trying to merge a pairs of fastq. Do you have any suggestion for another tools ? I already tested vsearch and illumina-utils. But both had worst results.

[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH]     ../raw/1001_1.fastq.gz
[FLASH]     ../raw/1001_2.fastq.gz
[FLASH]
[FLASH] Output files:
[FLASH]     ./1001.extendedFrags.fastq
[FLASH]     ./1001.notCombined_1.fastq
[FLASH]     ./1001.notCombined_2.fastq
[FLASH]     ./1001.hist
[FLASH]     ./1001.histogram
[FLASH]
[FLASH] Parameters:
[FLASH]     Min overlap:           10
[FLASH]     Max overlap:           65
[FLASH]     Max mismatch density:  0.250000
[FLASH]     Allow "outie" pairs:   false
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      64
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 64 combiner threads
[FLASH] Processed 25000 read pairs
[FLASH] Processed 50000 read pairs
[FLASH] Processed 75000 read pairs
[FLASH] Processed 100000 read pairs
[FLASH] Processed 125000 read pairs
[FLASH] Processed 150000 read pairs
[FLASH] Processed 175000 read pairs
[FLASH] Processed 200000 read pairs
[FLASH] Processed 225000 read pairs
[FLASH] Processed 250000 read pairs
[FLASH] Processed 273419 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      273419
[FLASH]     Combined pairs:   185744
[FLASH]     Uncombined pairs: 87675
[FLASH]     Percent combined: 67.93%
[FLASH]
[FLASH] Writing histogram files.
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 2.957 seconds elapsed

ADD REPLY • link 7.0 years ago by sacha ★ 2.4k

1

Entering edit mode

I've gotten good results using BBMerge from the BBMap package (guide available here).

ADD REPLY • link 7.0 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

I updated my merged plot. It was after low qualities filtering. Now the plot shows merged just after Flash

ADD REPLY • link 7.0 years ago by sacha ★ 2.4k

score 2 · Accepted Answer · 2017-04-25

2

Entering edit mode

7.0 years ago

h.mon 35k

As pointed in the comments, Illumina's problems with MiSeq v3 kit and 2x300bp are known and old, see for example here and here. Informally, (at least some) Illumina technicians will tell you to stay with 2x250bp, I am not sure if someone at Illumina said so on record.

ADD COMMENT • link 7.0 years ago by h.mon 35k