Fastqc To Check The Quality Of High Throughput Sequence
4
3
Entering edit mode
11.9 years ago
Varun Gupta ★ 1.3k

Hi I saw the video of fastqc under videos section on biostar. I have a question.

Why is it that often i find in my first 12-13 bases per base sequence content and **per base gc content are quite wavy even though per base sequence quality is very good. What can be done to fix them.

Have a look at the images

http://www.freeimagehosting.net/ffniw

http://www.freeimagehosting.net/96lzh

Regards

fastqc illumina rna-seq • 8.5k views
ADD COMMENT
1
Entering edit mode

can you post a plot or the numerical values?

ADD REPLY
0
Entering edit mode

(+1) definitely helps to see the fastQC plot.

ADD REPLY
0
Entering edit mode

I added the plots. Have a look

ADD REPLY
0
Entering edit mode

i added the plot have a look

ADD REPLY
0
Entering edit mode

Can you also tell if this is RNA-Seq data?

ADD REPLY
0
Entering edit mode

The data is RNA-Seq

ADD REPLY
5
Entering edit mode
11.9 years ago
Ryan Dale 5.0k

With RNA-seq, this can happen due to biases in random hexamer priming during the RT step (explaining the first 6 bases) possibly combined with sequence specificity of the polymerase itself and/or artifacts from end repair (possibly explaining out to 13 bases).

Check out Hansen et al. (2010). Biases in Illumina transcriptome sequencing caused by random hexamer priming. NAR 38(12):e31 for more info as well as some ideas on how to correct for it.

ADD COMMENT
0
Entering edit mode

I read the publication and it is not clear to me if then it would be better to remove those 13 bp at 5'

What's the best practice?

ADD REPLY
0
Entering edit mode

I think the assumption is that for standard differential expression, any sequence bias in a gene is the same between samples so it's not a problem. However it is a problem for estimating expression in a single sample (i.e. FPKM), since transcripts compared to each other may have different biases.

Luckily, Cufflinks includes bias correction for this (e.g., http://genomebiology.com/2011/12/3/r22/)

ADD REPLY
2
Entering edit mode
11.9 years ago

Hey Varun,

Have you checked whether those first few bases don't belong to any adaptor/barcode sequence ? Normally those sequences if left untrimmed may result into what you have mentioned above. I may be completely wrong but try to go through the FastQC report and if those sequences show up in Over-represented sequences section then you need to trim them off.

ADD COMMENT
1
Entering edit mode
11.9 years ago

The origin of the sample also matters. If the sample preparation isolates certain parts of a genome, for example a CHip-Seq experiment we could expect that to be reflected in the sequence content of the reads.

ADD COMMENT
1
Entering edit mode
11.9 years ago
T ▴ 40

If you have Illumina sequencing, this is a bias of random primers used by the technology and therefore expected.

ADD COMMENT

Login before adding your answer.

Traffic: 2562 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6