Question

mRNA-seq quality report (fastQC): Does it mean samples have adapters and should remove duplicates?

0

Entering edit mode

5.8 years ago

salamandra ▴ 550

The following fastqc report is common to most replicates of a mRNA-seq experiment:

there's bias in the beginning of the reads in the 'Per base sequence content' and there's 'Kmer Content' bias, however there's no error in 'Adapter content' module.

a) Does it mean that although reads are not contaminated with most known 'adapters' (like trueseq2 or nextera) they could be contaminated with other less common adapters? Note: I'm not sure which adapters were used in library preparation

a-1) Should I make a file with all types of adapters and use that file to remove from reads, or in case there's no adapter contamination this might bring problems?

b) 'Sequence duplication levels'

module also shows a warning, and we can see some duplicates 10-50 duplicated reads. However, if choose to remove duplicates, I will loose ~45% of the library. Should I remove duplicates or is this duplication level normal for highly expressed genes? Note: Total number of reads is ~ 15 million. Sequence duplication

RNA-Seq fastqc duplicates adapters • 3.1k views

ADD COMMENT • link updated 4.9 years ago by Biostar 20 • written 5.8 years ago by salamandra ▴ 550

2

Entering edit mode

5.8 years ago

Charles Warden 8.2k

I think it is useful to check out the blog posts described by genomax.

It's not unusual to have a relatively high duplicate rate for RNA-Seq data (although it can vary for different RNA-Seq protocols, and I would expect the duplicate estimation in aligned data to vary, at least to some extent, with single-end versus paired-end data). I don't believe I've seen a situation where removing duplicates has solved an issue when troubleshooting gene expression analysis, but testing different ways to process your data can likely help give you more confidence in your results.

So, if there is some difference in your results that you are concerned may be due to PCR duplicates (in addition to random overlap in high coverage regions in highly expressed genes), you could try visualizing the alignment with a program like IGV, and potentially seeing how duplicate removal affects quantification / clustering / etc. However, my guess is that is won't be crucial in most cases, and I would usually perform gene expression analysis without duplicate removal.

ADD COMMENT • link 5.8 years ago by Charles Warden 8.2k

score 7 · Accepted Answer · 2018-06-21

7

Entering edit mode

5.8 years ago

GenoMax 141k

Please see these blog posts from Dr. Simon Andrews' group (Author of FastQC). They should answer most of your questions.

Specifically,

Post #1 - Positional sequence bias in RNAseq
Post #2 - Sequence Duplication

How detrimental are duplicate reads in RNAseq experiments?
Don't remove duplicates.
It would be fine to scan/trim your data (with a program like bbduk.sh or trimmomatic both provide adapter sequence files in software distributions) since there may be some amount of extraneous sequence still present in your data (adapters etc). If you don't want to do scanning/trimming then most aligners should be able to soft-clip the extraneous sequences during alignments
Don't worry about k-mer content warnings unless you hit problems during downstream analysis. This module is now turned off by default in latest FastQC.

ADD COMMENT • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Regarding Post #1, there are some sentences i do not understand:

-> "The question then arises as to whether this bias has any implications for downstream analyses. There are a couple of potential concerns: 1-It’s possible that there is increased mis-priming as part of the bias – introducing an increased number of mis-called bases at the start of the sequence."

-> "The bias at the start of the sequences appears to be the result of biased selection of fragments from the library, so high levels of predicted SNPs are not an issue. "

-> "People often suggest fixing this issue by 5′ trimming of the reads to remove the biased portion – this however is not a fix. Since the biased composition is created by the selection of sequencing fragments and not by base call errors the only effect of trimming would be to change from having a library which starts over biased positions, to having a library which starts slightly downstream of biased positions." In this last sentence I understand that this won't solve the problem of having some overepresented fragments (fragments to which primers bind more) over others, but doesn't it solve the alignment problem? I mean...although reads are smaller after trimming, without biased portion they should align better, or not?

ADD REPLY • link 5.8 years ago by salamandra ▴ 550

0

Entering edit mode

In practice having that bias at the beginning of reads is shown to not cause any problems with alignment of data. You can verify this yourself with your own data. Since the bias will equally affect all samples that should not cause any batch effect when you do the analysis.

If you feel comfortable losing 15 bp of good data at the beginning of the read then you are welcome to chop those off. Remember that smaller reads could mean less precise mapping (so your alignment results may actually suffer). This will depend on the length of the read left after you scan/trim for adapter and additionally remove the 15 bp at front.

ADD REPLY • link 5.8 years ago by GenoMax 141k