Question

how to deal with batch effect in TCGA RNA-seq dataset

3

Entering edit mode

5.1 years ago

tujuchuanli ▴ 100

Hi,

I did differentially expression (DE) analysis on TCGA datasets and identified DE genes by edgeR (DE genes between cancer and normal samples). During my paper review, one of reviewers raised the question which is how to deal with batch effect.

I checked the edgeR manual. It could deal with batch effect by adding batch into designed matrix, just like: “design <- model.matrix(~Batch+Treatment)” in section 3.4.3. However, the Batch should be specified by user, just like: “Batch <- factor(c(1,3,4,1,3,4))”. To achieve it, I must know which sample is belonged to which batch and it is unknown to me in TCGA datasets. Besides it also provide a function called “plotMDS” to check batch effects in the datasets. But I didn`t know how to interpret this plot properly.

Do you know how to deal with batch effect in TCGA RNA-seq datasets? Can you teach me how to identify batch effect in MDS plot?

Thanks in advance.

TCGA RNA-Seq batch-effect • 4.5k views

ADD COMMENT • link updated 1 day ago by Ram 43k • written 5.1 years ago by tujuchuanli ▴ 100

0

Entering edit mode

Can you show the plot?

How to add images to a Biostars post

ADD REPLY • link 5.1 years ago by ATpoint 81k

score 3 · Answer 1 · 2019-03-25

TCGA doesn't provide much useful information for doing quality control. You won't be able to input known batches. Another good approach is to use housekeeping genes as controls that should be made more similar between samples. RUVSeq has functions for estimating batch effects with such genes or using spiked-in molecules and integrates seamlessly with edgeR. See its vignette for examples of how to use it.

score 1 · Answer 2 · 2019-03-25

Unfortunately there are no spikeins in TCGA data so I would be carefull using RUVSeq as Dario suggest. There are however two other options:

Either you can subset the TCGA data to only contain paired information (healty and tumor sample from same patient) and do a paired analysis. Since the paired samples are processed simultaneously it is quite difficult to imagine batch effect between those. If you wan to be really strict you can intersect the paired and unpaired (the analysis you already have) and only call the once identified in both analysis for significant. Due to the number of samples you will probably need to use voom + Limma and it's "duplicateCorrelation()" - see section 9.7 of the limma vignette.
You can use SVA which can "identifying and estimating surrogate variables for unknown sources of variation" - aka the batch effects in the TCGA data which you can then take into account in your model.