Dear community
I have a large set of FASTQ files from genomic DNA. I ran them through FastQC and found that the modules "overrepresented sequences" and "Kmer content" failed. The rest of the modules did not fail, except a warning in "Per tile sequence". Such pattern was present in almost all FASTQ files (>1000 files).
The "overrepresented sequences" module pointed out the presence of TruSeq adapters and Illumina PCR Primer 1.
I ran them through Trimmomatic to remove adapters. The module "overrepresented sequences" was fixed, but "Kmer content" failed again, only this time the pattern was different. Moreover, I get a new warning for the "Per sequence GC content" module (please see linked figure).
I have read that this pattern in "Kmer content" before trimming (kmers found at the beginning of the reads) could be due to fragmentation bias.
I worked with the adapter file provided by Trimmomatic (TruSeq3-PE-2.fa)
This are the flags I used for trimmomatic:
java -jar trimmomatic-0.38.jar PE -phred33 ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
I have two questions:
Is the "kmer content" and "Per sequence GC content" profiles after trimming something to worry about?
What could be a possible reason for the change in "kmer content" after trimming?
Here you can find the FastQC reports before and after running Trimmomatic:
https://drive.google.com/open?id=1vLY0FsXxnzJYT7d4X1TWZy96cSXu3XGs
https://drive.google.com/open?id=1Tk0GCy_SEz8ZrP2Y_3f_XYs1cnN11ScU
And here is a comparison of "kmer content" and "Per sequence GC content" before and after trimming:
https://drive.google.com/open?id=1YT6zbmKU_3DYlrTX_BLkMBOpGnmqg1Z7
Thank you very much in advance
Failing
k-mer content and GC content
in FastQC generally has no immediate adverse effect on your analysis. You should proceed with further analysis and see what you get. In latest FastQC k-mer analysis tool has been turned off by default since it causes more heartaches than necessary.