Biostar Beta. Not for public use.
Question: Why does this plot of human GC-content have a peak at around 60%
2
Entering edit mode

I'm new to RNA-Seq and have just run FastQC on my dataset. On the plots of GC content, all of the samples have a peak at around 60%, as shown here: http://i.imgur.com/YReFOV7.png

I've blasted a few of the most overrepresented sequences and each one hits multiple genes of multiple mammalian species with 100% identity. Each one hits the human signal recognition particle RNA (SRP 7SL), but also hits predicted targets in other mammals. Here's an example sequence:

GTTCTGGGCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGG

Can anyone suggest what could be causing this? As I say, I'm new to RNA-Seq so it could be some beginners misunderstanding/ignorance. I haven't touched the data in any way (no trimming or any other quality cut-offs) - they are run directly through FastQC. As far as I can tell, the main quality measures (Per base sequence quality, Per sequence quality scores) are good, though several of the others (Per base sequence content, Adapter content, and kmer content) show red flags.

In case it's useful, these were paired end reads generated on Illumina Total RNA TRUSEQ.

Thank-you for any help.

ADD COMMENTlink 4.2 years ago willj • 40 • updated 4.2 years ago dariober 10k
Entering edit mode
0

Update: so I've tried trimming adapters but the GC peak is still there...

ADD REPLYlink 4.2 years ago
willj
• 40
Entering edit mode
0

The same happened to me with this overrrepresented sequence:GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC. In my case, with ChIP-seq data from lab mice models. I trimmed and the QC report just got worse (and the GC content plot almost didn´t change). I blasted it now and it shows 93% match with Staphylococcus phage Andhra, but it also appears in the adapter catalog. Because of the Blast I could think there´s a contamination of the DNA of that virus (it´s a double-stranded DNA virus), but bc of being also an adapter I would think it makes more sense that´s an adapter contamination. But if it is an adapter, also why it doesn´t appear in the "adapter content plot"? I would like to see some well-founded explanation of this, because so far I just read suggestions such as "proceed with the mapping anyways that probably it won´t affect too much", but no real explanation.

ADD REPLYlink 12 months ago
msimmer92
• 180
0
Entering edit mode

It might be adapter contamination causing the spike. Try trimming the adapters and run fastQC again.

ADD COMMENTlink 4.2 years ago dariober 10k
Entering edit mode
1

In the overrepresented sequences, I do sometimes get a hit on the TruSeq Adapter (below). However, when I blast this it does not give similar hits to the other sequences I mentioned above. Anyway, I'll try trimming as you say.

Sequence Count Percentage Possible Source
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGC 232993 0.48295712295995535 TruSeq Adapter, Index 10 (100% over 50bp)
ADD REPLYlink 4.2 years ago
willj
• 40
Entering edit mode
1

Hi, I've now trimmed the adapters and removed low quality reads but the peak is still there.

ADD REPLYlink 4.2 years ago
willj
• 40

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0