Justification to open this as a new question:
I have seen that this was mentioned in some questions, but it didn't get an actual answer beyond a specific fix for the person that was asking (such as "maybe in your example you did the previous step, such as merging two samples, wrongly"). Since I see this goes beyond and I think there is a pattern, because it happens in different files in my lab where no previous step was done wrongly, I believe this deserves special attention.
Different ChIP-seq data sets in my lab, from human and mice, produced at different time points and by different people, have the same strange GC distribution (please, refer to the fastqc reports in the link below).
Here you have a fastQC report of my sample, before and after trimming, for you to see the GC distribution plot.
(I share the whole report in case someone wants to check out the other information)
In the overrepresented sequences, I get the following sequence:
which FastQC recognizes as a TruSeq Adapter (Index 8 (100% over 50bp)) (supposedly, all those samples already had their adapters removed). I took one of the samples and did the removal of that sequence using cutadapt:
cutadapt -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC -m 5 -o aftertrimming.fastq.gz beforetrimming.fastq.gz
(Note: when I don´t include the flag -m 5, "keep reads with minimum length 5", there are a bunch of Cs and spaces in the report. So I also did it with the -m5 flag, and that problem went away (because I removed the sequences, logically)).
The specific problem/question:
When I open the report, the hump is even a bit more pronounced. I have done this for other samples and the same happens. Does someone have a knowledgeable explanation of what is going on here and why is it so pronounced?
One of the things that I also don´t get is the following: if it was, indeed, adapter contamination, why removing the adapter doesn´t solve the problem and give me a good report output? Then, I thought "ok, maybe there is contamination". The problem is that, when you see the overrepresented sequences, there are no overrepresented sequences for me to blast and find out if it belongs to some microorganism in particular. Therefore, all these results seem very confusing when put together.
Thank you for your time!
The first one is the GC content distribution, from MultiQC, of a group of samples from human neurons (H3K4me3 ChIP-seq); look at the strange peak (which looks more like adapter contamination).
For the second plot, I took the "worst" sample from the first plot, trimmed adapters and got this second report; this is to show that, even removing adapter content, this hump (with a different look) remains.
The third figure is a MultiQC report showing the same plot for several ChIP-seq samples from mice neurons (H4K5 and H4K12); the idea is to show that even in an unrelated set of samples, of even another organism, the hump is also seen.