Having received a task to assemble phage genome, I and my colleague ran into several problems.
- First, the sequence duplication levels are abnormally high, reaching up to ~90% and ~40% for forward and reverse reads, respectively.
- Second, per tile sequence quality displays a mixture of alarming patterns.
This is per-tile sequence quality for raw R1 reads.
And per base sequence quality for raw R1 reads as well.
This is per-tile sequence quality for raw R2 reads.
And per base sequence quality scores for raw R2 reads
Initially, several possible explanations were born:
1) The cell might've been overloaded;
2) The duplication level might've been too high, hence the 4 distinct low quality bands;
3) Something is wrong with the sequencing platform, hence the long red bands in the per-tile sequence quality report.
After several unsuccessful rounds of fiddling with trimmomatic, I ended up specifying very strict quality control options: HEADCROP:10 SLIDINGWINDOW:3:32 MINLEN:230
. Using these options I ended up with ~22% (~250k out of ~1.1mln) and ~4% (~41k out of ~1.1mln) of initial sequences for forward and reverse reads, respectively. I also specified the Nextera Transposase adapter sequences, because the samples were badly contaminated. Nevertheless, the problem with per tile sequence quality persists (the adapters have been removed, though).
Per tile sequence quality for R1 reads after running trimmomatic
.
Together with the sequence quality scores.
Per tile sequence quality for R2 reads after running trimmomatic
. Red tiles didn't disappear and seemingly random bad quality patterns emerged to the left of the cell.
As well as the sequence quality scores.
Sequencing was performed on a Illumina HiSeq T1500 machine in rapid-run mode producing ~1.2 mln of 250bp paired-end reads.
What do you think might be the cause of such duplication levels, per tile sequence quality patterns and overall data quality?
Was something wrong with the sequencing procedure or the machine?
May this be due to the rapid-run? (We haven't seen anyone using it before).
Should we raise an alarm and contact the sequencing facility or are we being overly cautious?
Edit
Update:
We've just received additional information regarding the run from the sequencing facility:
Our samples make up only ~1.3% of the run.
0.5% of phiX was spiked in.
Cluster density is 1200.
200GB of data were produced during the run (although the upper limit should be 150GB as per machine's specification).
200GB of data were produced by the sequencer in this run.
Thank you for a response.
As far as we can tell there were 4 samples with largely different phage genomes. And they all exhibit these quality issues. We'll clarify if there might've been any other sample processed during the run. We also currently have no clue about phiX, which will hopefully be resolved after we receive an answer from the sequencing facility.
While it is possible that there were issues (software/hardware) during the run that would have to be confirmed by sequence provider. Generally they should not be releasing data from runs with known run problems. Can you post quality score plots for R1/R2? You should move forward with the analysis while you wait to hear from your provider.
Thank you for taking your time to respond. I've updated my post adding score plots for R1 and R2 both before and after running
trimmomatic
.While you did have a spread of Q scores for R2 it does look like the median score stayed high. You seem to have removed most low quality bases during trimming. I would not be greatly concerned about a few red tiles that show up afterwards. How much data did you lose in the trimming step?
I'm left with ~22% of R1 (~250k out of ~1.1mln) and ~4% of R2 reads (~41k out of ~1.1mln).
You should always trim R1/R2 reads together otherwise you will be left with orphan reads (i.e. R1 reads without their R2 mates as you see above) that many programs will have trouble with (assemblers for one). So you will lose a lot of data by the present method. How big is your phage expected to be? 41K reads (R1 and R2 mates) may still be enough.
Yes, we are running trimmomatic in PE mode (i.e. we trim both files simultaneously). We've reported all surviving reads (not just surviving pairs). The genome is presumably 170kbp long, and we expect ~50-70x coverage (given surviving pairs). It is enough for assembly, though it might not be sufficient to safely detect minor contamination (which is expected). Moreover, the real concern is that despite the unrealistically strict trimming parameters the per tile quality issue has not been fully resolved.
I've updated the question providing answers to these questions. Please, take a look.
I think this is a single lane machine (I familiar with other production level HiSeq sequencers) and it does look like the run was overloaded. So that could account for some of the bad quality on R2. I assume you only paid for fractional proportion of the data (which was relatively small here)?
We already addressed the duplication (small genome, large coverage so duplication not unusual). You could talk with the sequencing facility and express concern about the bad Q scores (which are likely due to overloading) which are leading to a loss of a large amount of data post-trimming/clean-up. I am not sure if they would be considerate enough to re-run the sample at no cost. It would depend on your relationship with them. You could always ask for a discount on a re-run.
Otherwise you have your 41K PE reads and you will have to move on with those.
Although it would be an ideal outcome, I'm not sure either, since communication is handled by another department. So yeah, seems like we are to move on with what we have.
Thank you very much for your answers, you've been very helpful!
You are welcome!
Check back and let me know if the BBMap tools, mentioned above, work for your data.