Question

Do I have batch effect?

0

Entering edit mode

3.9 years ago

avino ▴ 20

So I have two different RNA-seq experiments which I am going to call experiment 1 and 2.

Experiment 1 and 2 overviews in r:

glimpse(Experiment_1)
Rows: 47,069
Columns: 10
$ Gene           <chr> "ENSMUSG00000000001", "ENSMUSG00000000003", "ENSMUSG00000000028", "ENSMUSG00000000031", "ENSMUSG00000000037", "ENSMUSG000000…
$ Symbol         <chr> "Gnai3", "Pbsn", "Cdc45", "H19", "Scml2", "Apoh", "Narf", "Cav2", "Klf6", "Scmh1", "Cox5a", "Tbx2", "Tbx4", "Zfy2", "Ngfr", …
$ DMSO_Jour2_N2  <int> 23857, 0, 2696, 63, 3, 59, 4735, 327, 8869, 4549, 19894, 4, 0, 1, 3, 825, 27, 967, 8475, 3726, 8928, 2012, 2413, 6674, 0, 39…
$ DMSO_Jour2_N1  <int> 19501, 0, 2234, 36, 7, 52, 4116, 302, 7344, 4118, 18673, 3, 0, 1, 3, 738, 21, 722, 6512, 3175, 6986, 1721, 2068, 6143, 1, 34…
$ X4OHT_Jour2_N4 <int> 26304, 0, 2218, 176, 4, 40, 4719, 1800, 12798, 4648, 17333, 2, 0, 0, 0, 425, 41, 1155, 9205, 3685, 5419, 1854, 2429, 5037, 0…
$ X4OHT_Jour2_N1 <int> 20747, 0, 1745, 126, 10, 32, 3904, 1164, 8172, 3849, 14201, 4, 0, 0, 1, 292, 41, 905, 6955, 2909, 3903, 1502, 1766, 4834, 0,…
$ DMSO_Jour2_N4  <int> 23692, 0, 2411, 84, 7, 45, 3988, 386, 10012, 4000, 17047, 3, 0, 1, 3, 903, 18, 937, 8061, 3335, 9281, 1814, 2242, 5423, 1, 3…
$ DMSO_Jour2_N3  <int> 24641, 0, 2268, 58, 4, 52, 4919, 284, 8327, 3784, 18543, 2, 0, 0, 0, 663, 21, 914, 7923, 3511, 7063, 1886, 2022, 6116, 0, 37…
$ X4OHT_Jour2_N3 <int> 18933, 0, 1620, 75, 3, 30, 2894, 1200, 7935, 3237, 12891, 1, 0, 1, 0, 315, 41, 861, 6441, 2754, 3683, 1372, 1499, 4007, 0, 3…
$ X4OHT_Jour2_N2 <int> 23235, 0, 1877, 84, 5, 50, 3804, 1138, 9085, 3867, 16628, 4, 0, 0, 0, 355, 39, 978, 7224, 3318, 4277, 1620, 1712, 5443, 1, 4…

glimpse(Experiment_2)
Rows: 47,069
Columns: 8
$ Gene           <chr> "ENSMUSG00000000001", "ENSMUSG00000000003", "ENSMUSG00000000028", "ENSMUSG00000000031", "ENSMUSG00000000037", "ENSMUSG000000…
$ Symbol         <chr> "Gnai3", "Pbsn", "Cdc45", "H19", "Scml2", "Apoh", "Narf", "Cav2", "Klf6", "Scmh1", "Cox5a", "Tbx2", "Tbx4", "Zfy2", "Ngfr", …
$ X4OHT_Wnt3A_N3 <int> 11155, 0, 1258, 23, 4, 14, 1661, 1158, 6316, 2237, 7656, 0, 0, 0, 2, 283, 23, 543, 4489, 1649, 3288, 769, 1141, 2043, 1, 175…
$ X4OHT_Wnt3A_N2 <int> 8759, 0, 1228, 20, 1, 4, 1520, 859, 4903, 2076, 6249, 2, 0, 0, 1, 327, 8, 469, 4249, 1474, 5149, 769, 1149, 1427, 0, 1202, 1…
$ X4OHT_Wnt3A_N1 <int> 9200, 0, 1158, 35, 0, 10, 1217, 1138, 5631, 1913, 6097, 0, 0, 0, 1, 357, 16, 509, 4175, 1516, 5052, 817, 1201, 1662, 2, 1233…
$ DMSO_Wnt3A_N3  <int> 8061, 0, 899, 27, 0, 14, 1788, 142, 3114, 1416, 6476, 0, 0, 0, 0, 346, 4, 342, 3070, 1167, 3567, 683, 880, 1865, 0, 1130, 73…
$ DMSO_Wnt3A_N1  <int> 8808, 0, 1277, 14, 1, 9, 1049, 647, 5472, 2097, 6888, 0, 0, 0, 0, 558, 3, 440, 4058, 1552, 6196, 751, 1118, 1974, 0, 1170, 1…
$ DMSO_Wnt3A_N2  <int> 9096, 0, 1373, 32, 0, 13, 1360, 451, 4771, 2058, 6831, 1, 0, 0, 0, 591, 13, 437, 4166, 1590, 6693, 822, 1224, 1772, 0, 1320,…

In experiment 1, I compared condition "DMSO_Jour2" vs condition "X4OHT_Jour2", 4 replicates each;

In experiment 2, I compared condition "DMSO_Wnt3A" vs condition "X4OHT_Wnt3A", 3 replicates each.

Now I would like to get the gene_counts of those two different experiments and combine them (they have the same number of rows, aka genes) into two rearranged tables in a way to make two differential analyses, one comparing DMSO_Jour2 to DMSO_Wnt3A (called experiment "DMSO", see below) and another one comparing X4OHT_Jour2 to X4OHT_Wnt3A (called experiment "OHT", see below). Should I have to add a batch effect when I design my matrix?

So far this is how I make batch effect in R, following was an attempted first option to include batch effect in design matrix, which I am not able to...

replicate_DMSO <- factor(c(1,1,1,1,2,2,2))
replicate_OHT <- factor(c(1,1,1,1,2,2,2))

Batch <- factor(c(1,1,1,1,2,2,2))

design_DMSO <- model.matrix(~replicate_DMSO + Batch)
design_OHT <- model.matrix(~replicate_OHT + Batch)

design_DMSO
  (Intercept) replicate_DMSO2 Batch2
1           1               0      0
2           1               0      0
3           1               0      0
4           1               0      0
5           1               1      1
6           1               1      1
7           1               1      1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$replicate_DMSO
[1] "contr.treatment"

attr(,"contrasts")$Batch
[1] "contr.treatment"

design_OHT
  (Intercept) replicate_OHT2 Batch2
1           1              0      0
2           1              0      0
3           1              0      0
4           1              0      0
5           1              1      1
6           1              1      1
7           1              1      1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$replicate_OHT
[1] "contr.treatment"

attr(,"contrasts")$Batch
[1] "contr.treatment"

The problem is then when I try to run later in edgeR analysis common dispersion I got those errors:

y_DMSO <- estimateGLMCommonDisp(y_DMSO, design_DMSO, verbose=TRUE)
Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset,  : 
  Design matrix not of full rank.  The following coefficients not estimable:
 Batch2

y_OHT <- estimateGLMCommonDisp(y_OHT, design_OHT, verbose=TRUE)
Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset,  : 
  Design matrix not of full rank.  The following coefficients not estimable:
 Batch2

I know there is something off with my design matrix so I am wondering, do I really have to consider a batch effect in this case and how to address it to the design of the matrix.

When I do differential expression without batch effect I get tons of over-expressed genes.

As a second option to treat batch effect I used ComBat_seq (without normalizing first) from package sva on the total count matrix including experiment 1 and 2, then splitting into two table matrices for running the differential analyses DMSO and OHT.

Here, no error, but it kills all the variation and nothing is differentially expressed. I generated a dendrogram before sva batch effect and it looks pretty much I have two clusters that corresponded to the two batches...they disappear when I treat for batch with sva.

Any insights? Do I really have to treat for batch effect in this case?

Thank you very much.

RNA-Seq edgeR batch-effect • 1.5k views

ADD COMMENT • link updated 10 days ago by Ram 43k • written 3.9 years ago by avino ▴ 20

0

Entering edit mode

It sounds like your 2 batches have different conditions. In other words, none of the samples in either batch should be similar. Is that correct?

ADD REPLY • link 3.9 years ago by igor 13k

0

Entering edit mode

that is correct! Thanks

ADD REPLY • link 3.9 years ago by avino ▴ 20

score 2 · Answer 1 · 2020-06-12

2

Entering edit mode

3.9 years ago

i.sudbery 19k

To correct for batch effects, it is neccesary to have some experimental conditions that are present in both batches, which you don't have here. So you cannot correct for batch effects in your experiment. There is no algorithm, not edgeR, or combat, that can remove a batch effect when it can't compare two things that should look the same in two batches. Whether or not this is a problem depends on the question you are trying to ask of the data. There are questions you can ask of this dataset, without batch effect being an issue.

What is the contrast you are hoping to calculate?

ADD COMMENT • link 3.9 years ago by i.sudbery 19k

0

Entering edit mode

we have notice that if we look at supposedly stable genes (housekeeping or others), there is approximately a 2-fold difference between the expression levels (counts) of the 2 experiments, so, if I cannot accomplish batch effect because of the right issues you are saying, would it be safe to normalize the counts of both the original experiments based on these stable gene expression?

ADD REPLY • link 3.9 years ago by avino ▴ 20

1

Entering edit mode

No.

These two experiments cannot be easily normalised to each other. Just normalising to one or two housekeeping genes is generally not sufficient.

What I am saying is that it depends what quesiton you are asking.

With this data you can do: X4OHT_Wnt3A vs DMSO_Wnt3A ( contrast: X4OHT_Wnt3A - DMSO_Wnt3A) X4OHT_Jour2 vs DMSO_Jour2 ( contrast: X4OHT_Jour2 vs DMSO_Jour2)

Difference in the effect of X4OHT on Wnt3A and Jour2 (contrast: (X4OHT_Wnt3A - DMSO_Wnt3A) - ( X4OHT_Jour2 vs DMSO_Jour2))

Average effect of X4OHT across both genotypes (contrast: (X4OHT_Wnt3A + X4OHT_Jour2)/2 - (DMSO_Wnt3A + DMSO_Jour2)/2)

These experiments may be more or less difficult to code up in designs and contrasts, but they are conceputally possible.

You will never be able to do: X4OHT_Wnt3A vs X4OHT_Jour2 DMSO_Wnt3A vs DMSO_Jour2 or Average effect of Wnt3A across treatments

The experiment is just not designed to allow those comparisons and no amount playing with normalization or batch removal will change that.

You say that "supposedly stable" genes have a two fold change, but you have no real idea if that is because there is a batch effect or because Wnt3A has an effect on these "supposedly stable" genes.

ADD REPLY • link 3.9 years ago by i.sudbery 19k

0

Entering edit mode

Great...I appreciate the time you took to fully explained me that...it was nice of you! Thank you so much.

When you say I could see difference and average of X4OHT..., is there any resource you could address me to conceptually understand that?

ADD REPLY • link 3.9 years ago by avino ▴ 20

0

Entering edit mode

So the difference in the effect of X4OHT means -

If in the Jour2 samples, the difference in expression for a gene is 2 fold greater in X4OHT than DMSO and in the Wnt3A samples the same gene 6 fold greater expressed in X4OHT than in DMSO, then the difference in the effect between the Jour2 and Wnt3A samples is 4 fold: X4OHT has a 4-fold greater effect in Wnt3A than it does in the Jour2 samples.

To read more about this, have a search for "Difference in differences" designs.

In the average effect case, we basically just ignore the Wnt3A/Jour2 distinction, (or, at least, we just treat it as a blocking factor), So you'd conceptually average the all X4OHT samples and then divide by the average of all the DMSO samples. Its the "Treatment main effect", with Wnt3A/Jour as a blocking factor.

ADD REPLY • link 3.9 years ago by i.sudbery 19k

0

Entering edit mode

Thank you so much! just for clarification when I say 'jour2' aka DMSO and XOHT jour 2, in reality those samples are called DMSO and X4OHT without adding Wnt3A (DMSO_no Wnt3A jour2 and X4OHT_no Wnt3A jour 2)...I guess this does not change anything to the whole thing...we still have two original experiments that do no not have any condition in common...

ADD REPLY • link 3.9 years ago by avino ▴ 20

0

Entering edit mode

Do you think the following is a correct design formula for a "Difference in differences" case: genotype+Treatment+genotype:Treatment whereas genotype refers if the samples are X4OHT or DMSO and treatment would be with (+Wnt3A) or without Wnt3A (no Wnt3A jour 2), provided that the contrast would be exactly to characterize a situation described in your example above. Thank you so much!!

ADD REPLY • link 3.9 years ago by avino ▴ 20

1

Entering edit mode

I think the advice with this sort of model is generally to have a seperate "group" for each combination (so there would be 4 possible values in Group: X4OHT_jour2, DMSO_jour2, X4OHT_Wnt3A, DMSO_Wnt3A), and then build the model matrix with model.matrix(0+group).

You then specify the contrast with makeContrasts(( X4OHT_Wnt3A - DMSO_Wnt3A) - (X4OHT_jour2 - DMSO_jour2)).

I think using the design formula ~genotype + Treatment + genotype:Treatment such work, but figuring out the contrast will be harder.

ADD REPLY • link 3.9 years ago by i.sudbery 19k