extract primers from both forward and reverse sequenced sequences one by one (each sample)

Question

Validate QIIME pipeline using 16S rDNA amplicon data (Illumina MiSeq) already demultiplexed

0

Entering edit mode

7.4 years ago

antonioggsousa 3.2k

Hello QIIMERS,

Recently, I’ve been used QIIME to analyze 16S rRNA gene amplicons, from environmental seawater samples (9 in total), sequenced using Illumina MiSeq platform.

I used paired-end sequencing to target the V4-V6 (partially) hypervariable region of 16S rDNA using the universal primers 515YF-Y906R-jed. The amplicon size without primers has an average ≈ 370 bp. I have approximately 2 millions reads in total (9 samples).

Since I’m a newbie in bioinformatics, I would like to validate all the steps that I followed here, in QIIME Forum. Also I have a bunch of questions in each step.

So, I’ll try explain you each step that I followed and related questions in each one of it.

I followed the following pipeline: Illumina Overview Tutorial: Moving Pictures of the Human Microbiome (http://nbviewer.jupyter.org/github/biocore/qiime/blob/1.9.1/examples/ipynb/illumina_overview_tutorial.ipynb).

To start, I received my 16S data in fastq format with the samples multiplexed and also demultiplexed.

The main difficulty that I had it’s related to the barcodes used by the sequencing company in the Illumina MiSeq sequencing platform. The barcodes have degenerated bases and variable length.

If I understood well, the “split_libraries_fastq.py” performs the demultiplexing, primer removal and quality filtering steps. However, to identify and remove the barcodes it’s necessary provide the exact sequence of these in the mapping file or extract the barcodes through “extract_barcodes.py” script using the exact length of these. Since I don’t know the exact sequence of barcodes used (they’re degenerated) neither the exact length of these (they have variable length) I can’t use directly this script, at least as the first script. I asked to the support team from the sequencing company if it was possible use specific barcodes, i.e. with specific nucleotide sequences, or at least with the same length, and they answered me this: “the Illumina sequencing protocols rely on diversity of the clusters, especially for cluster identification in the early cycles. Simply put, if everything looks identical, the sequencers don't know where one cluster starts and were the other one ends, resulting in much lower output and worse quality. For amplicon projects (where all sequences are quite similar) this can be a big problem and that's why we use barcodes with variable lengths on purpose”.

So, the main obstacle that I had arise from this. I have difficulty to make the mapping file and to use the “split_libraries_fastq.py”.

1st) I decided to use the demultiplexed data and I started to remove the primers from both reads (forward and reverse reads - paired-end sequencing) by using the “extract_barcodes.py” script with length as the reference metric. I tried to used directly the “split_libraries_fastq.py” but didn’t remove the primers (I used the “--barcode_type” “not-barcoded” option). I tried to use the “multiple_extract_barcodes.py” but it arose an error at the time I was running this pipeline that I couldn’t solve it.

extract primers from both forward and reverse sequenced sequences one by one (each sample)

$ extract_barcodes.py -f dmx_non_comb/nice1_R1.fastq -r dmx_non_comb/nice1_R2.fastq -c barcode_paired_end -m map.txt --attempt_read_reorientation --bc1_len 19 --bc2_len 20 -o pr_out1
$ extract_barcodes.py -f dmx_non_comb/nice2_R1.fastq -r dmx_non_comb/nice2_R2.fastq -c barcode_paired_end -m map.txt --attempt_read_reorientation --bc1_len 19 --bc2_len 20 -o pr_out2
$ extract_barcodes.py -f dmx_non_comb/nice3_R1.fastq -r dmx_non_comb/nice3_R2.fastq -c barcode_paired_end -m map.txt --attempt_read_reorientation --bc1_len 19 --bc2_len 20 -o pr_out3
$ extract_barcodes.py -f dmx_non_comb/nice4_R1.fastq -r dmx_non_comb/nice4_R2.fastq -c barcode_paired_end -m map.txt --attempt_read_reorientation --bc1_len 19 --bc2_len 20 -o pr_out4
$ extract_barcodes.py -f dmx_non_comb/nice5_R1.fastq -r dmx_non_comb/nice5_R2.fastq -c barcode_paired_end -m map.txt --attempt_read_reorientation --bc1_len 19 --bc2_len 20 -o pr_out5
$ extract_barcodes.py -f dmx_non_comb/nice6_R1.fastq -r dmx_non_comb/nice6_R2.fastq -c barcode_paired_end -m map.txt --attempt_read_reorientation --bc1_len 19 --bc2_len 20 -o pr_out6
$ extract_barcodes.py -f dmx_non_comb/nice7_R1.fastq -r dmx_non_comb/nice7_R2.fastq -c barcode_paired_end -m map.txt --attempt_read_reorientation --bc1_len 19 --bc2_len 20 -o pr_out7
$ extract_barcodes.py -f dmx_non_comb/nice8_R1.fastq -r dmx_non_comb/nice8_R2.fastq -c barcode_paired_end -m map.txt --attempt_read_reorientation --bc1_len 19 --bc2_len 20 -o pr_out8
$ extract_barcodes.py -f dmx_non_comb/nice9_R1.fastq -r dmx_non_comb/nice9_R2.fastq -c barcode_paired_end -m map.txt --attempt_read_reorientation --bc1_len 19 --bc2_len 20 -o pr_out9

2nd) I joined the forward with the reverse reads using the “multiple_join_paired_ends.py” script.

raw sequencing data (demultiplexed)

multiple join paired-ends Illumina reads

output was the same for each sample: fastqjoin.join.fastq (assembled joined reads); fastqjoin.un1.fastq (unassembled unjoined reads 1); fastqjoin.un2.fastq (unassembled unjoined reads 2)

$ multiple_join_paired_ends.py -i pr_output -o jpe --read1_indicator "_R1." --read2_indicator "_R2."

3rd) I performed the quality filtering step through “split_libraries_fastq.py” script using a mapping file with “BarcodeSequence” column empty (this is the same file - map.txt - described above but with a different name) and the option “--barcode_type” “not-barcoded”.

rename each fastq file “nice1.fastq” and so on for all the 9 samples

demultiplex, remove primers and quality filter our sequences

-r (=3), -p (=75%) are default parameters and -q is equal to 20

$ split_libraries_fastq.py -i jpe/nice1.fastq,jpe/nice2.fastq,jpe/nice3.fastq,jpe/nice4.fastq,jpe/nice5.fastq,jpe/nice6.fastq,jpe/nice7.fastq,jpe/nice8.fastq,jpe/nice9.fastq --sample_id nice1,nice2,nice3,nice4,nice5,nice6,nice7,nice8,nice9 -o split_libraries_output/ -m mapping.txt -q 20 --barcode_type "not-barcoded"

4th) I identified the chimeric seqs using just the de novo strategy through usearch.

Here, in this step, I have some doubts. Which reference database should I use? In the below script I just opted to use the de novo strategy, but If I want to use both, which reference database should I use? If understand well the description available in the paper published by QIIME designers ("Advancing our understanding of the human microbiome using QIIME") I have to use a database with chimeric sequences previously identified in others studies. But a colleague of one member of our lab used the GreenGenes database with 97% of similarity as the reference chimeric database. I don’t understand why! I mean if one sequence isn’t assigned as de novo chimeric sequence neither to a reference sequence in the database it’s excluded from our dataset? It doesn’t´t make sense to me, because this sequence could be I new species previously not identified.

Another issue related to this step is the time required to conclude it. I’ve been used MOTHUR too, because I’ve been trying to compare which is the best or the advantages and disadvantages of each one. In MOTHUR this step using the same algorithm - usearch - it takes hours (maybe more than one day if I use my own laptop - MacBook Pro13) to identify chimeric sequences de novo, while in QIIME it takes a few minutes, maybe half an hour, to do the same action - identify chimeric sequences - using the same algorithm over the same dataset. So I'm not sure if I'm doing something wrong. I know from previous papers (e.g. http://www.omicsonline.org/open-access/a-comparison-of-three-bioinformatics-pipelines-for-the-analysis-ofpreterm-gut-microbiota-using-16s-rrna-gene-sequencing-data-jpb-1000381.pdf) that QIIME is faster than MOTHUR, but this difference is so enormous. I don’t know if this different is intrinsically related with the fact that QIIME wraps the different algorithms instead of being re-implemented such as MOTHUR pipeline. Also, I know that the critical step to be performed in MOTHUR is the “cluster” where the sequences are assign to OTUs, because this requires read long distance matrices which requires giant computational effort and time.

identify chimeras using usearch (download the version 6.1 beta and .exe in macqiime path through “chmod 775 “path/usearch61””)

$ identify_chimeric_seqs.py -m usearch61 -i split_libraries_output/seqs.fna -o usearch61_chimera_checking/ --suppress_usearch61_ref

5th) I removed the chimeric sequences previously identified.

remove chimeric sequences

$ filter_fasta.py -f split_libraries_output/seqs.fna -o chimeras_filtered.fna -s usearch61_chimera_checking/chimeras.txt -n

6th) I picked the OTUs using the open reference method and usearch as clustering algorithm - “pick_open_reference_otus.py”.

Here, one time (not in this script) I attempted to use the SILVA database instead GreenGenes db, but the results that arose were quite similar. I tried call the reference and taxonomy through command-line and changing the directory of Greengenes db by SILVA db, but in both times the result given was quite similar. So I'm not sure if I'm doing something wrong or the results are commonly quite similar. But, since GreenGenes db used was released in 2013 and SILVA in 2015 I would expect some differences. For example, one of the most abundant archaeal groups - Thaumarchaeota (phylum with ammonia-oxidizing archaea representatives) - in me dataset appears as Crenarchaeota phylum description when I use the GreenGenes db, but I would expect Thaumarchaeota, the new, recent re-classification of this phylum, when I use the SILVA dt. It always assigns the Crenarchaeota phylum.

pick OTUs using open reference method using as clustering tool usearch

$ pick_open_reference_otus.py -o pick_or_otus/ -i nc_seqs/chimeras_filtered.fna -m usearch61

7th) Finally, I performed the main downstream analysis through “core_diversity_analyses.py” script. As sequencing depth value I choose “-e 62704” that was the minimum number of sequences that were identified in one sample using the command “biom summarize-table”. I know that this value is critical for statistical analysis, but I didn’t want loose any sample during this analysis. I don’t know if I should have choose a lower value.

core diversity analysis

$ core_diversity_analyses.py -o cdout/ -i pick_or_otus/otu_table_mc2_w_tax_no_pynast_failures.biom -m mapping.txt -t pick_or_otus/rep_set.tre -e 62704

/macqiime/anaconda/lib/python2.7/site-packages/skbio/stats/ordination/_principal_coordinate_analysis.py:107: RuntimeWarning: The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive eigenvalues. If the negative ones are smaller, it's probably safe to ignore them, but if they are large in magnitude, the results won't be useful. See the Notes section for more details. The smallest eigenvalue is -0.000462237876835 and the largest is 0.411794395989. RuntimeWarning

I would like that someone with experience in the bioinformatic field, specially with the pipeline of QIIME, validate this steps that I followed and answer the questions addressed to each of them.

Thanks in advance. Best wishes, @renh@.

next-gen QIIME 16SrRNA Illumina MiSeq • 5.3k views

ADD COMMENT • link updated 7.4 years ago by GenoMax 141k • written 7.4 years ago by antonioggsousa 3.2k

score 1 · Answer 1 · 2016-11-22

1

Entering edit mode

7.4 years ago

Daniel ★ 4.0k

I highly recommend asking at the actual QIIME forum, it's got high traffic and the devs spend tonnes of time there, you're more likely to get good answers.

https://groups.google.com/forum/#!forum/qiime-forum

ADD COMMENT • link 7.4 years ago by Daniel ★ 4.0k

0

Entering edit mode

Thank you Daniel.

I already did it, but I didn't get any answer yet.

Best wishes, @renha@

ADD REPLY • link 7.4 years ago by antonioggsousa 3.2k