RNA-seq batch effect due to sequencing platform
1
6
Entering edit mode
8.9 years ago

Background: I have to perform a differential gene expression analysis using RNA-seq data. We have two genotypes. We have RNA-seq data for control and data after treatment for both genotypes. We also have two biological replicates for each case. In short, we have eight samples that include four samples before (2 genotypes x 2 replicates) and four samples after the treatment. Our goal is to find out genes that show differential expression between the genotypes after the treatment. Note: Actually we have have several treatments but I have tried to keep the question simple.

Problem: Now the problem is that each biological replicate was run on a different sequencing platform including Ion Proton and SOLiD Wildfire. Trust me it wasn't my idea. Don't kill the messenger (bioinformatician) :-)

Now we see a huge difference in expression counts between biological replicates that is purely due to batch (platform) effect. PCA clusters samples according to platforms and not the treatment or the strains. The samples from Ion proton always show high read counts. Same applies to RPKM values so the problem is not because of the difference in sequencing depth. The batch effect is not consistent between all the pair of biological replicates, and correlation between counts from two different platforms (or biological replicates) range between 0.3 to 0.6 for different case. I can use the batch as a covariate in my DEseq2 analysis, but is there A) any other better approach to remove the variation due to different sequencing platforms. Reason being is that there are samples after multiple treatments and we may need to merge reads from almost similar treatments into one later on. So scaling or correcting values will be better so that the new counts from almost similar treatments may be merged into one. B) Should I perform correction at the level of biological replicate or should I create two groups (Wildfire and IonProton) and perform batch correction using all the samples (4 Wildfire and 4 IonProton, actually I have lots of samples for wildfire and Ionproton as I have multiple treatments but I mentioned only two as I wanted to keep the question simple) ? C) I have never used Combat but I read that it doesn't work for small sample sizes, so I may need to carry out batch correction using all the samples although the batch effect is inconsistent. Also as Combat takes log transformed normalized data as input, I won't be able to use new output counts as input for DESeq2. I may have to use limma, right? Please excuse me if I haven't used the correct terminology. I am new to this.

Thanks.

batch-effect RNA-Seq • 4.6k views
ADD COMMENT
1
Entering edit mode

Hey Ashutosh,

How about using Surrogate Variance Analysis for removal of batch effect. There is "Combat" of SVA package from Bioconductor to remove batch effect.

or I think quantile normalization of your log transformed counts per million would also help in your case

ADD REPLY
0
Entering edit mode

Thanks Manvendra.

ADD REPLY
0
Entering edit mode

RUVSeq worked very well on my dataset. May be you can give a try.

ADD REPLY
1
Entering edit mode
8.9 years ago

As far as I'm aware, SVAseq will only identify potential technical variation, not correct for it (though this may have changed from the last time I looked). What analyses are you carrying out? I think you're going to have to tackle this in a different way to "batch correction", but rather account for it in model designs. If you're using DESeq2 for example, include the batches as a term in the model design.

ADD COMMENT
0
Entering edit mode

Thanks Andrew.

ADD REPLY

Login before adding your answer.

Traffic: 3198 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6