Question

General question about subsetting data for RNASeq DESe2 analysis

0

Entering edit mode

6.5 years ago

swbarnes2 14k

I have an experiment that is split into two halves. The experimental compound used is the same in each half, same type of sample, but the other conditions are different. When I do PCA, or look at the Euclidean distance between the samples, they definitely split into two halves. Would it be preferred for me to split the data into two halves when looking for DE genes, or should I try to keep them together as much as possible? I feel like I ought to use all the data I have together, but there are subsets of samples I want to compare to each other, and the designs I want to use are "not full rank" in the complete dataset. (Maybe I need a cute trick with the column data to make the design I want legal?)

Would it be okay for me to do it both ways, and keep the set of results that has lower p-values? Or does it not much matter, since the p-values are very very low, and the top scoring genes are pretty much the same either way?

RNA-Seq DESeq2 • 1.8k views

ADD COMMENT • link 6.5 years ago by swbarnes2 14k

0

Entering edit mode

Hi, I have never before considered dividing a dataset in the way that you describe. I'd be more interested in figuring out why there is a divide in the first place. On that note, how much % variation is explained by PC1 (or the PC along which the samples divide)?

Check all experimental/clinical factors to check whether any of they are responsible. Also check gender.

There has to be a logical reason to explain the divide.

Please post an image too, if you can.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

The information I have on the experimental set-up and conditions is very limited. The first component, which splits the samples by the day of the Illumina run, accounts for 79% of the variance. But as far as I can tell, there were other conditions that differ between the samples run on the two days, so maybe the difference is biological, and not batch effect.

ADD REPLY • link 6.5 years ago by swbarnes2 14k

0

Entering edit mode

That's quite telling, i.e., 79% variation and the sample groups are divided by the day on which they were run. It's either a huge batch effect or a completely different biological condition... impossible to tell from where I am, of course!

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

But my question is, is there anyway to know from the math alone if I'm better off splitting the samples, or keeping them together? Does it depend on whether or not it's expected to be biological versus batch effect? Or should I not worry about it the top 100 genes either way are mostly the same?

ADD REPLY • link 6.5 years ago by swbarnes2 14k

0

Entering edit mode

You pretty much can't proceed (or at least should not) until you are sure of what happened. It's not normal to see that large variation between two sample groups. The natural assumption would be that it's a huge batch effect.

If you want, re-do the normalisation but include 'Day' (on which samples were sequenced) in the design model. This may completely remove the effect.

Also, if you're using DESeq2's PCA function, then change to use my code ( A: PCA plot from read count matrix from RNA-Seq ). DESeq2's function filters out a large chunk of genes based on low variation prior to doing the PCA, thus, it maximises/exaggerates differences in your cohort.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Even doing PCA with all the genes, instead of the top 500 most variant ones, the batch effect is still 61%. But your think I gain more by keeping all, and putting "day" into the designs, instead of subsetting, and only comparing same-day samples?

ADD REPLY • link 6.5 years ago by swbarnes2 14k

0

Entering edit mode

Yes, I meant including 'day' in the DESeq2 design model when normalizing. This may completely eliminate the effect. If it eliminates the effect, then just continue with the analysis as normal and forget about 'day', i.e., don't even do any subsetting. If it does not eliminate the effect, then breaking up the dataset could be an option.

I'm limited here to whatever information you're passing to me, of course.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k