I have 87 Whole Exome Sequenced samples (Agilent SureSelect Exome v7 library and NovaSeq sequencer). The Illumina adaptater and the small reads (<30pb) were removed with CutAdapt.
For fastq QCs, my only problem is with %GCs. Here is the multiQC result after a fastQC:
Even though I have 8 bad samples in red, the majority of the samples are approximately centered around 50%GC. (I assume that both bumps are due to errors during library preparation or sequencing?)
However, my main concern is after the alignment with BWA. I obtained this figure for the GC content :
I have one peak at 70% and another one around 90%, which is really problematic.
The HSMetrics showed that I have approximatively 85% of bases aligned on baits (so 15% bases that are off-bait).
When I tried to locate these GC-rich reads I usually fall in intronic or intergenic regions. However sometimes I fall at the end of exons, as with this example:
Do you have an idea about how to remove these reads?
Thank you for your help.
Are you aligning to the entire genome?
Yes, the alignment was against the hg19 genome version.
Did you find any solution to this?