Confused about the quality control strategy of 16s pair_end sequence from Illumina miseq platform
2
2
Entering edit mode
9.2 years ago
hua.peng1314 ▴ 100

I am new to metagenome and I am confused about the quality control strategy of 16s pair_end sequence from Illumina miseq platform. No doubt quality control strategy may affect the downstream analysis. I find the slide window of 50bp is common used. But for pair_end reads, I used FLASH software to assemble the contigs, I am not sure about the QC strategy. Reads are truncated at the end of the last window before the average quality score falls below the threshold, even if downstream windows would again rise above the average quality score threshold. Unfortunately about half of my reads were truncated too short. So I used the strategy of FASTX with -p 60 -q 20.Little reads were trimmed. Is it not strict enough? any suggestion? Thanks.

quality-control 16s • 6.2k views
ADD COMMENT
1
Entering edit mode
9.2 years ago

Use flash first and do quality trimming after that. Flash uses quality scores for merging reads and your joined read would have a better quality after merging.

ADD COMMENT
0
Entering edit mode

Thanks for reply. I just did as you say. What confuse me is what QC strategy should be performed after FLASH.

The slide window of 50bp and the FASTX with -p 60 -q 20 seem not suitable enough.

ADD REPLY
0
Entering edit mode

I see. Sorry, didn't get it from your post. I had that problem with fastx before and not sure what is the reason for that. It seems like other people also have that problem https://biostar.usegalaxy.org/p/7715/ At the end I used a quality trimmer built-in into Pipeline Pilot but haven't found a free analog for it. Have you tried other trimmers, like trimmomatic?

ADD REPLY
0
Entering edit mode
9.2 years ago

This is a good reason to not use arbitrary-sized sliding windows. BBDuk's quality-trimming gives optimal output for a given quality threshold and does not rely on specific window sizes; rather, trimq=X guarantees that the result will be the largest subsequence with average quality of at least X such that extending in either direction would add a subsequence with average quality below X.

I do not recommend fastx anyway as it defaults to the wrong quality encoding and is incapable of processing paired reads together, even aside from the fact that it is slow and uses non-optimal algorithms. Trimmomatic also relies on windows, and is also slow, so I don't recommend that either (though at least it processes pairs together).

If you merge reads with BBMerge, though, I do not recommend trimming first; it performs trimming internally only if needed. Trimming first can reduce merge rate by eliminating the overlapping parts of reads.

ADD COMMENT
0
Entering edit mode

BBDuk is really an excellent work. But as the reads with lower quality bases at the end. After merging the lowest quality region would be in the middle of the merged reads.No matter what kind of sliding window would probably

cut the reads at the middle of the reads. So merging will be less significance. Is that right?

I find many people only use the R1 reads.The information included in the R2 reads lost.

So I prefer FASTX on the whole merged reads.But just as Marina and Brian said. It also has some problems.

I wonder if there is a better choice.

ADD REPLY
0
Entering edit mode

20 is an extremely high threshold for quality trimming; too high for most purposes. And after merging, the overlapping bases will have their quality scores increased anyway if they match, to reflect the fact that 2 independent observations were made of the same base. Also, trimming tools in general don't trim middle bases and break sequences apart - they trim at the ends. For sliding windows, they generally start at one end, trim until the average inside the window is above some value, then stop.

Are you sure you need to do trimming? What are you doing with the data after you have trimmed and/or merged it?

ADD REPLY
0
Entering edit mode

Thanks for your patient and friendly. I add answer but not reply because I can't click the "ADD REPLY" button. Maybe I am stupid to have ignored something.

Trimming is not necessary for me.I just want to do something to make the OTUs to be generated correctly.

After merging with FLASH(default parameters) I used the split_library function in qiime with the recommend parameter -q 19 and found about half of reads were truncated shorter than 200bp.

So I change to FASTX with -p 60 -q 20. Little reads were filtered. After Chimera filtering and OTU clustering with the default pick_otu function in QIIME, I get 1121840 OTU at last and I have about 4188862 reads before Chimera filtering. Is it too much?

Moreover I used the unique.seqs(MOTHUR) to do dereplication and get 4019141 unique reads. I have deal with some 454 data with MOTHUR before. I am not sure if it's normal as from different platform.

So I wonder if my merging and QC steps are right and ask for help here.

ADD REPLY
0
Entering edit mode

Well, in my opinion, quality of your reads is very important for OTU picking. You don't want to pick a wrong OTU or miss it because of the quality of your data.

1121840 OTUs is quite a lot but I don't know the nature of your data.

Are you combining QIIME and mothur pipelines or comparing them?

4019141 unique reads out of 4188862 does look suspicious. I was getting much lower number of unique reads with 16s miseq. But again, I have no idea about the nature of your data.

ADD REPLY
0
Entering edit mode

My samples are collected from potting soil. I also think it looks really suspicious.

I combining QIIME and mothur pipelines for mothur is hard to deal with so many reads.

Here is what I did:

flash -M 300 R1.fastq R2.fastq
less out.extendedFrags.fastq
fastx/fastx_barcode_splitter.pl --bcfile forward_primer --bol --mismatches 2 --prefix p1. --suffix .fastq #(The data I get have already been split according to barcodes and I do this to check the forward primer.)
fastx/fastq_quality_filter -i out.extendedFrags.fastq -q 20 -p 60 -l 200 -o out.fastq -Q 33
fastx/fastq_to_fasta -i out.fastq -o out.fasta
mothur "#chimera.uchime(fasta=out.fasta,reference=gold.fa,processors=20)"
mothur "#remove.seqs(fasta=out.fasta,accnos=meta.unique.uchime.accnos)"
pick_otus.py -i seqs.fna -o picked_otus_default

Any problem?

ADD REPLY
0
Entering edit mode

pick_otus.py with the parameters you chose gives you just a file with your clustered reads, not OTUs. If you want to use reference either run pick_otus.py -i seqs.fna -r refseqs.fasta -m uclust_ref or pick_closed_reference_otus.py script (I would recommend the second as it gives you the biom file and avoids all the pain of converting files or running multiple scripts).

ADD REPLY

Login before adding your answer.

Traffic: 2512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6