Biostar Beta. Not for public use.
What batch correction was applied to pan-Cancer mRNA expression data?
1
Entering edit mode
14 months ago
user31888 • 60
United States

I would need to retrieve the normalisation (and maybe the batch correction method) used to produced the pan-Cancer Atlas mRNA expression matrix (file called 'EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv' found here).

Starting from the raw read counts obtained from the GDC and the same gene panel, I tried FPKM and FPKM-UQ normalisation as described here, but the expression values obtained do not fall at all in the same range as in the pan-Cancer mRNA matrix. Maybe that would suggest a cross-sample batch correction.

My goal is, starting from raw read counts, to normalise expression data from new samples together with the pan-Cancer mRNA data, in order to get a unified expression matrix and to be able to compare apples to apples basically.

Any information or alternative method would be greatly appreciated.

ADD COMMENTlink
4
Entering edit mode
10 months ago
i.sudbery 4.7k
Sheffield, UK

My guess (and it is only a guess), given the name of the file, that this is build from the RSEM quantification results that are present in the Broad Institute's Firehose portal, rather than from read counts.

RSEM use an EM algorithm to build isofrom expression values. A length-weighted sum of these values is then used to create gene expression values.

The firehose documentation states that these are normalised like so:

RSEM expression estimates are normalized to set the upper quartile count at 1000 for gene level and 300 for isoform level estimates.

Please note that this is definately NOT a batch correction and that batch effects have been shown to be a serious problem with PanCancer analyses (although this is at the level of somatic variants)

ADD COMMENTlink
0
Entering edit mode

Thanks @i.sudbery !

You are right, the expression data from the different TCGA cancer types have been obtained from Firehose pipelines and merged together to form the pan-Cancer Atlas expression matrix 'EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv'.

Looking at the pipelines used on Firehose ('MapspliceRSEM' here), it seems that RSEM was used for read quantification, then normalised by setting the upper quartile count to 1,000, as you mentionned.

However, when starting from read counts, I still cannot retrieve similar expression values using GetNormalizedMat, along with MedianNorm or QuantileNorm functions from the EBSeq package (manual here).

ADD REPLYlink
0
Entering edit mode

You will not be able to retrieve similar quantifications starting from read numbers and RSEM uses a fundementally different model to estimate expression compared to a read counting model.

ADD REPLYlink
0
Entering edit mode

@i.sudbery: That's right.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1