Biostar Beta. Not for public use.
mRNA-seq quality report (fastQC): Does it mean samples have adapters and should remove duplicates?
0
Entering edit mode
19 months ago
salamandra • 200

The following fastqc report is common to most replicates of a mRNA-seq experiment:

there's bias in the beginning of the reads in the 'Per base sequence content' and there's 'Kmer Content' bias, however there's no error in 'Adapter content' module.

a) Does it mean that although reads are not contaminated with most known 'adapters' (like trueseq2 or nextera) they could be contaminated with other less common adapters? Note: I'm not sure which adapters were used in library preparation

a-1) Should I make a file with all types of adapters and use that file to remove from reads, or in case there's no adapter contamination this might bring problems?

module also shows a warning, and we can see some duplicates 10-50 duplicated reads. However, if choose to remove duplicates, I will loose ~45% of the library. Should I remove duplicates or is this duplication level normal for highly expressed genes? Note: Total number of reads is ~ 15 million.

6
Entering edit mode
4 weeks ago
genomax 68k
United States

Please see these blog posts from Dr. Simon Andrews' group (Author of FastQC). They should answer most of your questions.

Specifically,

Post #1 - Positional sequence bias in RNAseq
Post #2 - Sequence Duplication

• How detrimental are duplicate reads in RNAseq experiments?
• Don't remove duplicates.
• It would be fine to scan/trim your data (with a program like bbduk.sh or trimmomatic both provide adapter sequence files in software distributions) since there may be some amount of extraneous sequence still present in your data (adapters etc). If you don't want to do scanning/trimming then most aligners should be able to soft-clip the extraneous sequences during alignments
• Don't worry about k-mer content warnings unless you hit problems during downstream analysis. This module is now turned off by default in latest FastQC.
0
Entering edit mode

Regarding Post #1, there are some sentences i do not understand:

-> "The question then arises as to whether this bias has any implications for downstream analyses. There are a couple of potential concerns: 1-It’s possible that there is increased mis-priming as part of the bias – introducing an increased number of mis-called bases at the start of the sequence."

-> "The bias at the start of the sequences appears to be the result of biased selection of fragments from the library, so high levels of predicted SNPs are not an issue. "

-> "People often suggest fixing this issue by 5′ trimming of the reads to remove the biased portion – this however is not a fix. Since the biased composition is created by the selection of sequencing fragments and not by base call errors the only effect of trimming would be to change from having a library which starts over biased positions, to having a library which starts slightly downstream of biased positions." In this last sentence I understand that this won't solve the problem of having some overepresented fragments (fragments to which primers bind more) over others, but doesn't it solve the alignment problem? I mean...although reads are smaller after trimming, without biased portion they should align better, or not?

0
Entering edit mode

In practice having that bias at the beginning of reads is shown to not cause any problems with alignment of data. You can verify this yourself with your own data. Since the bias will equally affect all samples that should not cause any batch effect when you do the analysis.

If you feel comfortable losing 15 bp of good data at the beginning of the read then you are welcome to chop those off. Remember that smaller reads could mean less precise mapping (so your alignment results may actually suffer). This will depend on the length of the read left after you scan/trim for adapter and additionally remove the 15 bp at front.

2
Entering edit mode
15 months ago
Duarte, CA

I think it is useful to check out the blog posts described by genomax.

It's not unusual to have a relatively high duplicate rate for RNA-Seq data (although it can vary for different RNA-Seq protocols, and I would expect the duplicate estimation in aligned data to vary, at least to some extent, with single-end versus paired-end data). I don't believe I've seen a situation where removing duplicates has solved an issue when troubleshooting gene expression analysis, but testing different ways to process your data can likely help give you more confidence in your results.

So, if there is some difference in your results that you are concerned may be due to PCR duplicates (in addition to random overlap in high coverage regions in highly expressed genes), you could try visualizing the alignment with a program like IGV, and potentially seeing how duplicate removal affects quantification / clustering / etc. However, my guess is that is won't be crucial in most cases, and I would usually perform gene expression analysis without duplicate removal.