Question

Fastqc To Check The Quality Of High Throughput Sequence

3

Entering edit mode

11.9 years ago

Varun Gupta ★ 1.3k

Hi I saw the video of fastqc under videos section on biostar. I have a question.

Why is it that often i find in my first 12-13 bases per base sequence content and **per base gc content are quite wavy even though per base sequence quality is very good. What can be done to fix them.

Have a look at the images

http://www.freeimagehosting.net/ffniw

http://www.freeimagehosting.net/96lzh

Regards

fastqc illumina rna-seq • 8.5k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 11.9 years ago by Varun Gupta ★ 1.3k

1

Entering edit mode

can you post a plot or the numerical values?

ADD REPLY • link 11.9 years ago by JC 13k

0

Entering edit mode

(+1) definitely helps to see the fastQC plot.

ADD REPLY • link 11.9 years ago by Arun 2.4k

0

Entering edit mode

I added the plots. Have a look

ADD REPLY • link 11.9 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

i added the plot have a look

ADD REPLY • link 11.9 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

Can you also tell if this is RNA-Seq data?

ADD REPLY • link 11.9 years ago by Arun 2.4k

0

Entering edit mode

The data is RNA-Seq

ADD REPLY • link 11.9 years ago by Varun Gupta ★ 1.3k

Ram · Answer 1 · 2012-06-05

5

Entering edit mode

11.9 years ago

Ryan Dale 5.0k

With RNA-seq, this can happen due to biases in random hexamer priming during the RT step (explaining the first 6 bases) possibly combined with sequence specificity of the polymerase itself and/or artifacts from end repair (possibly explaining out to 13 bases).

Check out Hansen et al. (2010). Biases in Illumina transcriptome sequencing caused by random hexamer priming. NAR 38(12):e31 for more info as well as some ideas on how to correct for it.

ADD COMMENT • link 11.9 years ago by Ryan Dale 5.0k

0

Entering edit mode

I read the publication and it is not clear to me if then it would be better to remove those 13 bp at 5'

What's the best practice?

ADD REPLY • link 9.3 years ago by Illinu ▴ 110

0

Entering edit mode

I think the assumption is that for standard differential expression, any sequence bias in a gene is the same between samples so it's not a problem. However it is a problem for estimating expression in a single sample (i.e. FPKM), since transcripts compared to each other may have different biases.

Luckily, Cufflinks includes bias correction for this (e.g., http://genomebiology.com/2011/12/3/r22/)

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Ryan Dale 5.0k

score 2 · Answer 2 · 2012-06-04

Hey Varun,

Have you checked whether those first few bases don't belong to any adaptor/barcode sequence ? Normally those sequences if left untrimmed may result into what you have mentioned above. I may be completely wrong but try to go through the FastQC report and if those sequences show up in Over-represented sequences section then you need to trim them off.

score 1 · Answer 3 · 2012-06-04

1

Entering edit mode

11.9 years ago

Istvan Albert 100k

The origin of the sample also matters. If the sample preparation isolates certain parts of a genome, for example a CHip-Seq experiment we could expect that to be reflected in the sequence content of the reads.

ADD COMMENT • link 11.9 years ago by Istvan Albert 100k

score 1 · Answer 4 · 2012-06-05

1

Entering edit mode

11.9 years ago

T ▴ 40

If you have Illumina sequencing, this is a bias of random primers used by the technology and therefore expected.

ADD COMMENT • link 11.9 years ago by T ▴ 40