Unexpected hump in GC content distribution from FasQC report in ChIP-seq
0
1
Entering edit mode
5.1 years ago
msimmer92 ▴ 300

Justification to open this as a new question:

I have seen that this was mentioned in some questions, but it didn't get an actual answer beyond a specific fix for the person that was asking (such as "maybe in your example you did the previous step, such as merging two samples, wrongly"). Since I see this goes beyond and I think there is a pattern, because it happens in different files in my lab where no previous step was done wrongly, I believe this deserves special attention.

The context/clues:

Different ChIP-seq data sets in my lab, from human and mice, produced at different time points and by different people, have the same strange GC distribution (please, refer to the fastqc reports in the link below).

Here you have a fastQC report of my sample, before and after trimming, for you to see the GC distribution plot. (I share the whole report in case someone wants to check out the other information) https://drive.google.com/drive/folders/1i_cQmwlSpFLgFepoNm2NayC15Zigm9cY?usp=sharing

In the overrepresented sequences, I get the following sequence:

GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC

which FastQC recognizes as a TruSeq Adapter (Index 8 (100% over 50bp)) (supposedly, all those samples already had their adapters removed). I took one of the samples and did the removal of that sequence using cutadapt:

cutadapt -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC -m 5 -o aftertrimming.fastq.gz beforetrimming.fastq.gz

(Note: when I don´t include the flag -m 5, "keep reads with minimum length 5", there are a bunch of Cs and spaces in the report. So I also did it with the -m5 flag, and that problem went away (because I removed the sequences, logically)).

The specific problem/question:

When I open the report, the hump is even a bit more pronounced. I have done this for other samples and the same happens. Does someone have a knowledgeable explanation of what is going on here and why is it so pronounced? One of the things that I also don´t get is the following: if it was, indeed, adapter contamination, why removing the adapter doesn´t solve the problem and give me a good report output? Then, I thought "ok, maybe there is contamination". The problem is that, when you see the overrepresented sequences, there are no overrepresented sequences for me to blast and find out if it belongs to some microorganism in particular. Therefore, all these results seem very confusing when put together.

Thank you for your time!

IMAGE DESCRIPTIONS The first one is the GC content distribution, from MultiQC, of a group of samples from human neurons (H3K4me3 ChIP-seq); look at the strange peak (which looks more like adapter contamination). For the second plot, I took the "worst" sample from the first plot, trimmed adapters and got this second report; this is to show that, even removing adapter content, this hump (with a different look) remains. The third figure is a MultiQC report showing the same plot for several ChIP-seq samples from mice neurons (H4K5 and H4K12); the idea is to show that even in an unrelated set of samples, of even another organism, the hump is also seen.

Example of GCcontent MultiQC distribution for huamn neuron samples before adapter removal Example of the GCcontent distribution with hump in human neurons. H3K4me3

Example of the same distribution for mice ChIP-seq data. H4K5 and H4K12 data.

ChIP-Seq fastqc quality • 3.5k views
ADD COMMENT
0
Entering edit mode

msimmer92 : Files do not appear to be properly shared via google drive/may want to convert into PDF's and share. There are past examples of threads about bi-modal GC distribution (e.g. Bimodal GC content and others) that you may want to look through.

ADD REPLY
0
Entering edit mode

Oh, ok, I put them in PDF now. Now you should be able to see them. In regards to the second part of your comment, yes, I saw that post.. and there are some interesting examples, but none of them truly explains the hump of this type. The one that got closer to my case (because of the shape of the distribution) was ponizvezdochka but, as John pointed out, the simple difference between exonic and intronic counts shouldn´t be the reason for an abnormal distribution, since supposedly this is taken into account (in the plot); you should see a normal distribution when all are put together (in her case, she was only working with exon data, that´s why it was insightful for her to see the difference between intronic and exonic counts; in my case is H3K4me3 ChIP-seq in human neurons). Also, my hump is on the other side (around 60 %GC content).

ADD REPLY
0
Entering edit mode

You appear to have a bunch of primer dimers. What sort of alignment rates are you getting with this data? A second hump is often expected in ChIP-seq data, but I wouldn't expect it regardless of what was being ChIPed, as you indicate is the case.

ADD REPLY
0
Entering edit mode

This sample is part of a group of replicates. They have an average of 98% mapping and this sample has 87.55% (this is without further adapter removal, with the samples as I got them.. and they have the bad Fastqc report with the hump I talked about). That sample, the one I report, is the one that has the worst hump (because of this, I removed the sequence and posted it here in case there was a second problem I was ignoring). Here you have the Bowtie2 output for that sample´s mapping if you want further details: 31117061 reads; of these: 31117061 (100.00%) were unpaired; of these: 3874509 (12.45%) aligned 0 times 20856966 (67.03%) aligned exactly 1 time 6385586 (20.52%) aligned >1 times. 87.55% overall alignment rate. [bam_sort_core] merging from 0 files and 32 in-memory blocks... Done.

ADD REPLY
0
Entering edit mode

Sorry, I want to clarify this: you mentioned a second hump is expected in ChIP-seq data? I did not know that, that´s why I was hesitant when I saw that the data did not fit to the gaussian distribution. None of the sources I read mentioned this; do you have a good source from where to read this? Because a lot of articles or manuals mention general ideas or applied things to other data types, such as RNA-seq that seems to be more popular than ChIP-seq. (As an example: in the Biostars Handbook First Edition, GC content plot is the only one that is not explained in the QC chapter, nor the ChIP-seq chapter).

ADD REPLY
2
Entering edit mode

I guess we should edit the handbook to mention this :)

The reason you tend to see a second bump in ChIP-seq is that a lot of marks that people are IPing are enriched in the promoter region, which will have a different GC distribution than the rest of the genome (it's usually higher), so you have a bump for the background genomic GC distribution and then a second one for where your mark/protein is found. What you don't expect is to see the same bump if you ChIP a broader mark or one not near promoters. Do you see a similar second bump in your inputs?

BTW, are you at the DZNE? I did my post-doc there.

ADD REPLY
0
Entering edit mode

Oh, I never considered that! That makes a lot of sense. Here you have the figures for the MultiQC report (before I did adapter removal, this is how I got them) of the inputs that correspond to the files I posted before. The first one is for the human H3K4me3 neuronal data, and the second one the H4K5 and H4K12 mice neuronal data. Even thought they look a bit weird, as if there is another problem, that specific hump is not seen anymore. I didn´t notice until now that there was a hump difference between inputs and chips. Thank you very much! I will think of how to confirm that that hump indeed corresponds to the promoters in my samples. I think it would be cool and interesting.

Regarding the other comments: That´s great! yes, I´m in DZNE Göttingen finishing my master thesis and will start my PhD in April :). My lab is not actively having a problem with this samples, but I am curious and want to learn as much as I can about what happens behind the scenes in these analysis (also to make them as pristine as possible). I noticed these humps and the lack of info about it in the resources and wanted to report it. By the way, thank you very much for the Biostars Handbook. I started bioinformatics before it went out, and I can compare the before and after: it´s live-saving. Keep up the great work!

First Figure. Input MultiQC report for human H3K4me3 data. No hump is observed.

Second figure. Input MultiQC report for mice H4K5 and H4K12 data. No hump is observed, even though the distribution does not seem totally fine, either.

ADD REPLY
1
Entering edit mode

I expect the spikes in the first MultiQC plot are from adapter dimers, they tend to look like that. For looking at promoter enrichment have a look at plotEnrichment or plotProfile from deepTools.

ADD REPLY

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6