Question

Highly uneven RINs, mapping rates, and methods to deal with this (normalisation, salmon)

1

Entering edit mode

5.2 years ago

chris86 ▴ 400

Hi

I have Illumina mRNA-seq samples where it seems because of low RINs (2-4) in a bunch of them compared to the others, I am getting very widely varying mapping rates (15%-70%) and therefore counts per sample (e.g. 8,000,000 mapped reads vs 40,000,000). Plus I can't really use RIN/mapping rate as a covariate because it is very confounded with a group of interest. Looking at excluding another 20 low mapping rate samples atm.

Is there a preferred way of analyzing this type of data? If I do the usual VST through DESEQ2 I get a cluster of samples with irregular high expression of a lot of genes, also the ones with low numbers of overall sample counts, presumably this is because of what I describe above. I was wondering if quantile normalisation would help as it uses rankings to make the samples more comparable, this could be the kind of extreme situation where it may help... Are there any other ideas?

I also used Salmon to quantify the data using the gc bias and validate mappings flags. Reads are 150bp PE. If I do not run gc bias correction and validate mappings, the mapping rates go up about 10%, but I suspect the quality of those mappings is reduced so currently am using the data with these flags.

Thanks,

Chris

RNA-Seq salmon normalisation • 1.1k views

ADD COMMENT • link 5.2 years ago by chris86 ▴ 400

0

Entering edit mode

And if you do run gc bias correction?

ADD REPLY • link 5.2 years ago by i.sudbery 19k

0

Entering edit mode

I'm not sure whether it is validate mappings or gc bias correction that is lowering the mapping rates, but it seems those flags are generally recommended and I have used both. I also tried reducing read length to 75bp or mapping just single ended reads, but this did not increase the mapping rates. For some reason illumina seem to think shorter (50bp-75bp) single ended reads are a bit preferred for transcriptome mRNA quantification because they don't span splice junctions (http://emea.support.illumina.com/bulletins/2017/04/considerations-for-rna-seq-read-length-and-coverage-.html).

ADD REPLY • link 5.2 years ago by chris86 ▴ 400

score 1 · Answer 1 · 2019-02-07

If you are getting some odd genes with very high expression, I suspect quantile normalisation isn't going to help.

Prediction: These highly expressed genes are short.

Because mRNA-seq uses polyA selection, in the degraded samples you are only going to be getting reads in the very 3' of transcripts.

A couple of things you could try:

Is there a RIN cutoff where you get samples from both groups?
You could try visualising the 3' bias using a metagene, locate where the coverage was dropping off in your poor samples, and then truncate your transcript models to that legnth so you were only quantifying the 3' end in all your samples.
If the poor samples have very high duplciation rates, you might want to consider deduplication.
You could also downsample the good samples to make them look more like the bad ones.

Note that all these things are mostly about making your good samples look like the poor ones, which is in the end not ideal. Although you might get something out of these analyses, you might want to consider what a reviewer would think about them. If I were a reviewer, and the samples weren't limiting I probably say you had to do the experiment again.