Biostar Test Site

This is site is used for testing only. Visit: https://www.biostars.org to ask a question.

1
2
Entering edit mode
4.3 years ago
sacha ★ 2.1k

we did a 16S RNA analysis using MiSeq kit version 3. This kit makes paired-end reads with a size of 300pb. By merging both reads, we can cover 535 pb of region V3-V5.

As you can see in the following plot, qualities of reads are really bad after the nucleotid 250. Is it normal ? Perhaps the strategy prefers reads lenghs to the quality ? Did you get something similiar with the same kit ?

After merging with Flash

miseq illumina quality phred fastqt • 3.6k views
2
Entering edit mode

1
Entering edit mode

Thanks for posting the nice diagram of ribosomal variable regions!

If the q-scores can be trusted, this is actually not too bad. I mean, it's bad, but not unusable; I've seen far worse from 2x300 kits. You will get better quality from 2x250bp kits, but then you can't get the whole region. I'm not an expert on 16S so I can't offer advice on the best strategy or region, but overall this data looks promising.

I recommend that people sequence a known mock community comprised of known organisms with complete references to evaluate true positive/false positive/true negative/false negative rates of a methodology before employing it for real research. This would be less necessary if the quality scores were accurate, but they are not, and Illumina shows little interest in changing that (they've generally become less accurate over time).

You might be able to ascertain the true quality scores by assembling with a kmer longer than the length of the conserved regions, and mapping to the assembly. I;m not sure how well it will work, but it's worth a try. Something like this, using BBMap:

bbmerge.sh in1=r1.fq in2=r2.fq out=merged.fq qtrim2
bbmap.sh in1=r1.fq in2=r2.fq mhist=mhist.txt qahist=qahist.txt pairedonly maxindel=100


If assembly works, that will tell how accurate your bases actually are. I have not tried it before, though, with amplicon data. Also, "k=93" might need to be increased to cover the longest conserved region, presumably the gap between V4 and V5 based on the diagram. If everything is identical there, it would need to be increased; if they're just mostly similar, it won't. You can try assembling with different kmer lengths and see which one gives the best results; ideally, you want lots of >500bp contigs.

You could also try mapping to Silva, if these are mostly already-sequenced organisms...

0
Entering edit mode

I think Illumina sequencing is not famous for long read lengths. I am afraid 300 bp is too much for MiSeq protocol. You will be better off with smaller reads (100-150?)

0
Entering edit mode
0
Entering edit mode

Yes they sell it, but I hear from users that it doesn't mean that the quality stays good at the end. Check out the posts @genomax2 pointed out.

0
Entering edit mode

Typical values for PE-300 runs. But the lowest median Q scores for the merged reads are still >20 (99% accurate). That's sufficient for most applications - is it not for yours?

0
Entering edit mode

But I lost 40% of my reads by merging and cleaning with Q>10 .

0
Entering edit mode

How did you lose reads by merging? Long as you have a good reference to align to having sub-optimal Q scores may not be a big deal.

0
Entering edit mode

I merged my paired-end data with Flash . https://ccb.jhu.edu/software/FLASH/

  flash sample_F.fastq sample_R.fastq  > merged.fastq


merged.fastq contains 500pb read length . But only 60% reads are preserved.

0
Entering edit mode

Losing 40% is surprising. The first graph indicates that the vast majority of Q<10 bases are in the overlap region (236-300). Merging those should produce Q>10 values for most of those bases. I haven't used Flash; how does it calculate Q scores of overlapping bases?

0
Entering edit mode

Flash try to align the overloap between the forward and the reverse. If the alignement score is too low, the read is ignored. Here is the output of flash trying to merge a pairs of fastq. Do you have any suggestion for another tools ? I already tested vsearch and illumina-utils. But both had worst results.

[FLASH] Starting FLASH v1.2.11
[FLASH]
[FLASH] Input files:
[FLASH]     ../raw/1001_1.fastq.gz
[FLASH]     ../raw/1001_2.fastq.gz
[FLASH]
[FLASH] Output files:
[FLASH]     ./1001.extendedFrags.fastq
[FLASH]     ./1001.notCombined_1.fastq
[FLASH]     ./1001.notCombined_2.fastq
[FLASH]     ./1001.hist
[FLASH]     ./1001.histogram
[FLASH]
[FLASH] Parameters:
[FLASH]     Min overlap:           10
[FLASH]     Max overlap:           65
[FLASH]     Max mismatch density:  0.250000
[FLASH]     Allow "outie" pairs:   false
[FLASH]     Cap mismatch quals:    false
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]
[FLASH]
[FLASH]     Total pairs:      273419
[FLASH]     Combined pairs:   185744
[FLASH]     Uncombined pairs: 87675
[FLASH]     Percent combined: 67.93%
[FLASH]
[FLASH] Writing histogram files.
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 2.957 seconds elapsed

1
Entering edit mode

I've gotten good results using BBMerge from the BBMap package (guide available here).

0
Entering edit mode

I updated my merged plot. It was after low qualities filtering. Now the plot shows merged just after Flash

2
Entering edit mode
4.3 years ago
h.mon 32k

As pointed in the comments, Illumina's problems with MiSeq v3 kit and 2x300bp are known and old, see for example here and here. Informally, (at least some) Illumina technicians will tell you to stay with 2x250bp, I am not sure if someone at Illumina said so on record.