Justification to open this as a new question:
I have seen that this was mentioned in some questions, but it didn't get an actual answer beyond a specific fix for the person that was asking (such as "maybe in your example you did the previous step, such as merging two samples, wrongly"). Since I see this goes beyond and I think there is a pattern, because it happens in different files in my lab where no previous step was done wrongly, I believe this deserves special attention.
The context/clues:
Different ChIP-seq data sets in my lab, from human and mice, produced at different time points and by different people, have the same strange GC distribution (please, refer to the fastqc reports in the link below).
Here you have a fastQC report of my sample, before and after trimming, for you to see the GC distribution plot. (I share the whole report in case someone wants to check out the other information) https://drive.google.com/drive/folders/1i_cQmwlSpFLgFepoNm2NayC15Zigm9cY?usp=sharing
In the overrepresented sequences, I get the following sequence:
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC
which FastQC recognizes as a TruSeq Adapter (Index 8 (100% over 50bp)) (supposedly, all those samples already had their adapters removed). I took one of the samples and did the removal of that sequence using cutadapt:
cutadapt -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC -m 5 -o aftertrimming.fastq.gz beforetrimming.fastq.gz
(Note: when I don´t include the flag -m 5, "keep reads with minimum length 5", there are a bunch of Cs and spaces in the report. So I also did it with the -m5 flag, and that problem went away (because I removed the sequences, logically)).
The specific problem/question:
When I open the report, the hump is even a bit more pronounced. I have done this for other samples and the same happens. Does someone have a knowledgeable explanation of what is going on here and why is it so pronounced? One of the things that I also don´t get is the following: if it was, indeed, adapter contamination, why removing the adapter doesn´t solve the problem and give me a good report output? Then, I thought "ok, maybe there is contamination". The problem is that, when you see the overrepresented sequences, there are no overrepresented sequences for me to blast and find out if it belongs to some microorganism in particular. Therefore, all these results seem very confusing when put together.
Thank you for your time!
IMAGE DESCRIPTIONS The first one is the GC content distribution, from MultiQC, of a group of samples from human neurons (H3K4me3 ChIP-seq); look at the strange peak (which looks more like adapter contamination). For the second plot, I took the "worst" sample from the first plot, trimmed adapters and got this second report; this is to show that, even removing adapter content, this hump (with a different look) remains. The third figure is a MultiQC report showing the same plot for several ChIP-seq samples from mice neurons (H4K5 and H4K12); the idea is to show that even in an unrelated set of samples, of even another organism, the hump is also seen.
msimmer92 : Files do not appear to be properly shared via google drive/may want to convert into PDF's and share. There are past examples of threads about bi-modal GC distribution (e.g. Bimodal GC content and others) that you may want to look through.
Oh, ok, I put them in PDF now. Now you should be able to see them. In regards to the second part of your comment, yes, I saw that post.. and there are some interesting examples, but none of them truly explains the hump of this type. The one that got closer to my case (because of the shape of the distribution) was ponizvezdochka but, as John pointed out, the simple difference between exonic and intronic counts shouldn´t be the reason for an abnormal distribution, since supposedly this is taken into account (in the plot); you should see a normal distribution when all are put together (in her case, she was only working with exon data, that´s why it was insightful for her to see the difference between intronic and exonic counts; in my case is H3K4me3 ChIP-seq in human neurons). Also, my hump is on the other side (around 60 %GC content).
You appear to have a bunch of primer dimers. What sort of alignment rates are you getting with this data? A second hump is often expected in ChIP-seq data, but I wouldn't expect it regardless of what was being ChIPed, as you indicate is the case.
This sample is part of a group of replicates. They have an average of 98% mapping and this sample has 87.55% (this is without further adapter removal, with the samples as I got them.. and they have the bad Fastqc report with the hump I talked about). That sample, the one I report, is the one that has the worst hump (because of this, I removed the sequence and posted it here in case there was a second problem I was ignoring). Here you have the Bowtie2 output for that sample´s mapping if you want further details: 31117061 reads; of these: 31117061 (100.00%) were unpaired; of these: 3874509 (12.45%) aligned 0 times 20856966 (67.03%) aligned exactly 1 time 6385586 (20.52%) aligned >1 times. 87.55% overall alignment rate. [bam_sort_core] merging from 0 files and 32 in-memory blocks... Done.
Sorry, I want to clarify this: you mentioned a second hump is expected in ChIP-seq data? I did not know that, that´s why I was hesitant when I saw that the data did not fit to the gaussian distribution. None of the sources I read mentioned this; do you have a good source from where to read this? Because a lot of articles or manuals mention general ideas or applied things to other data types, such as RNA-seq that seems to be more popular than ChIP-seq. (As an example: in the Biostars Handbook First Edition, GC content plot is the only one that is not explained in the QC chapter, nor the ChIP-seq chapter).
I guess we should edit the handbook to mention this :)
The reason you tend to see a second bump in ChIP-seq is that a lot of marks that people are IPing are enriched in the promoter region, which will have a different GC distribution than the rest of the genome (it's usually higher), so you have a bump for the background genomic GC distribution and then a second one for where your mark/protein is found. What you don't expect is to see the same bump if you ChIP a broader mark or one not near promoters. Do you see a similar second bump in your inputs?
BTW, are you at the DZNE? I did my post-doc there.
Oh, I never considered that! That makes a lot of sense. Here you have the figures for the MultiQC report (before I did adapter removal, this is how I got them) of the inputs that correspond to the files I posted before. The first one is for the human H3K4me3 neuronal data, and the second one the H4K5 and H4K12 mice neuronal data. Even thought they look a bit weird, as if there is another problem, that specific hump is not seen anymore. I didn´t notice until now that there was a hump difference between inputs and chips. Thank you very much! I will think of how to confirm that that hump indeed corresponds to the promoters in my samples. I think it would be cool and interesting.
Regarding the other comments: That´s great! yes, I´m in DZNE Göttingen finishing my master thesis and will start my PhD in April :). My lab is not actively having a problem with this samples, but I am curious and want to learn as much as I can about what happens behind the scenes in these analysis (also to make them as pristine as possible). I noticed these humps and the lack of info about it in the resources and wanted to report it. By the way, thank you very much for the Biostars Handbook. I started bioinformatics before it went out, and I can compare the before and after: it´s live-saving. Keep up the great work!
I expect the spikes in the first MultiQC plot are from adapter dimers, they tend to look like that. For looking at promoter enrichment have a look at plotEnrichment or plotProfile from deepTools.