Biostar Beta. Not for public use.
Question: Bimodal GC content
6
Entering edit mode

Hi

I'm analysing some human CLL data (cancer, whole exome), and when running fastqc to see how data are I observe all samples do show a bimodal GC content. Generally the only warn shown by Fastqc happens for the GC module, the other normally are good.
I have runned the fastq_screen unly against human genome having a 80% only one hit reads, 18% having multiple hits and about 0.6% not mapping against human, this is making me thinking that no contamination is present in the samples.

After some thoutghs I do not know why samples do show this kind of distribution.

Anyone?

Thanks for your time.

ADD COMMENTlink 4.0 years ago Folder40g • 120 • updated 23 months ago raulAlc • 10
Entering edit mode
3

Used to routinely see this with Agilent human exomes, never had a particularly good explanation for it other than there might be some inherent bias in the baits?

ADD REPLYlink 4.0 years ago
Daniel Swan
13k
Entering edit mode
3

Just to back up Dan's answer, I've seen the same in Agilent exomes, and haven't come up with a reasonable explanation. Traditionally with things like RNA seq, this bimodal distribution would make me go straight to the possibility of sample contamination, but with exomes it seems more systematic more than anything else.

ADD REPLYlink 4.0 years ago
andrew.j.skelton73
5.7k
Entering edit mode
0

This data were obtained also by whole-exome seq library Agilent SureSelect.

ADD REPLYlink 4.0 years ago
Folder40g
• 120
Entering edit mode
2

Did you ever find a solution for your issue runnerbio?
I wrote a small tool to drill down into BAM statistics like GC% to see if your secondary peak is over-represented in certain reads (certain chromosomes, certain mapping conformations, certain read flags, certain fragment lengths, certain read tags, etc etc).
I haven't published it to github yet, but if you would be interested in 'test driving' it to see if it can help you figure out your issue, i'd be more than willing to give some support as you go along :) Heres a video - skip to about min. 9:00 :) https://vimeo.com/123508180

ADD REPLYlink 4.0 years ago
John
12k
Entering edit mode
0

No I haven't found a reason for this behavior. I don't think there is contamination from bacterias or fungi in theses samples, neither I think that heterogeneity of samples can cause this (this is exome data, I may think that RNA data and heterogeneity in samples could show bimodal GC content). And finally, as said in here by two mates, it seems to be a general "pattern" for Agilent exomes.

I'll take a view of the video, I think it may be worthy to take a look to your tool to see if it gives a answer to the bimodal GC contente in exomes.

ADD REPLYlink 4.0 years ago
Folder40g
• 120
3
Entering edit mode

Good news everyone! To be honest we obtain such strange pictures with bimodal distribution of GC in every run. Just finished inspection of one human sample, decided to intersect my bam file reads with exonic and intronic regions downloaded from ucsc - and it fits perfectly.

Thats how it looks in FastQC:

fastqc gc

And this is the same GC plot colored according to its genomic location - you can see there is two main peaks for introns and exons respectively:

gc_genoic_region

So, here is one more possible explanation of bimodal GC content, but it is library-specific. In our lab we use Agilent Focused Exome. Hope this would help!

ADD COMMENTlink 3.7 years ago ponizvezdochka • 40
Entering edit mode
1

While I think that's some great detective work Liu, this may not be the answer for some people - for example, if I do the same analysis as you on some data which does not have a bimodal peak, I also get the same breakdown as you got for exonic/intronic GC%

In other words, yes GC% for intronic and exonic DNA is different, but you should still expect to see a normally distributed GC% plot for unbiased/untargeted sequencing when looking at all the reads together.

But it's still very interesting :)

ADD REPLYlink 3.7 years ago
John
12k
Entering edit mode
1

Absolutely agree John, in my case library is targeted on exons but there are still some reads map on introns.

ADD REPLYlink 3.7 years ago
ponizvezdochka
• 40
Entering edit mode
1

Ahh i see - ok awesome :) Well that's very interesting then that you only see a few more reads in exons than introns with that assay. Also, your ggplot GC% graph is so much more detailed (for the intron/exon series) than the FASTQ one. I really wish FASTQ would stop smoothing their graphs.

ADD REPLYlink 3.7 years ago
John
12k
Entering edit mode
2

I wonder what the plot looks like for the off-target reads that are intergenic

ADD REPLYlink 3.7 years ago
Daniel Swan
13k
Entering edit mode
0

Would you mind sharing more details of how you plotted this ? I would like to try it out on my samples. Thanks!

ADD REPLYlink 12 months ago
msimmer92
• 180
2
Entering edit mode

I don't have particular experience with either human nor exome sequencing, but I came across similar distributions in genome sequencing projects. Among others, I have observed it for a highly repetitive plant. In that case, the second peek corresponded to specific repeat class, that was really highly abundant in the data set.

Giving your mapping result, I concur, contamination is unlikely. So I would try to figure out from which locations of the genome these high GC reads derive and whether you can associate that with some useful annotations. Based on your mappings, you could extract regions from the genome with proper reads coverage, e.g. with bedtools, and than look for entire sequences or large windows of high GC.

ADD COMMENTlink 4.0 years ago thackl ♦ 2.6k
Entering edit mode
0

Hello, dear thackl I was running a denovo rnaseq expriment on a plant.similarity, my fastq GC content result is bimodal. Is it possible for you to more explain about "the second peak corresponded to specific repeat class"? I think it is depended to existance of chloroplast genome, what is your idea? best regards

ADD REPLYlink 2.4 years ago
eyonesi
• 0
1
Entering edit mode

Hi! I recently stumbled upon this nice little example of a bimodal distribution of GC content for an WG-Seq of orange. We were suspecting possible contamination. Upon blasting some of the reads with high %GC, I came upon hits that looked like: "C.limon DNA for clsat_9 satellite" (satellite DNA), looking at the citation ( https://link.springer.com/article/10.1007/s001220100719 ) I did corroborate that Citrus are rich in satellite DNA which has a GC-content between 60% and 68%. So that explained our secondary peak. Cool!

GC content in orange

ADD COMMENTlink 23 months ago raulAlc • 10
0
Entering edit mode

I don't think that you can necessarily extend the observations made above to directly to RNASeq experiments. Also, I don't really know if a bimodal GC distribution is something to be concerned about in the first place when looking at RNAseq. You might need to talk to people more involved with RNASeq. Sorry.

ADD COMMENTlink 2.4 years ago thackl ♦ 2.6k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0