Question

Data selection for RNA-seq data analysis

0

Entering edit mode

6.4 years ago

Arindam Ghosh ▴ 510

My work involves downloading RNA-seq data from NCBI-SRA and its analysis to find DE genes. In such a case is it advisable to select data from different sequencers? For example data sequenced from Illumina HiSeq 1500, 2000 and 2500. Also if same sequencer but different library preparation methods. I was wondering if we could pre-process, align and count each data separately and then go for DE analysis.

RNA-Seq ngs data • 1.5k views

ADD COMMENT • link updated 6.4 years ago by ivivek_ngs ★ 5.2k • written 6.4 years ago by Arindam Ghosh ▴ 510

score 2 · Answer 1 · 2017-11-22

It depends on your question of the study. If it is a data-driven study that tries to account for sequencing batches then Approach 1 is better suited. If its more in line with a biological hypothesis Approach 2 is ideal when Approach 1 upon correction does not yield a meaningful answer to your biological question you are trying to address. I would put a few suggestions here

Approach 1:

If you want to interrogate a specific study that has different layers of data coming from different machines and sequenced by different operators with different library preparation, you will risk for batch effects. Now if you have apriori information of the batches in this data you can model around them using combat and if not then you will need something like SVA or RUVSeq.
To perform the one above you will need to download raw fastq files from the study in SRA that you are interested. Quantify all the samples together with the aligner or mapper of your interest providing the proper information of libType (as Salmon/Kallisto prefers such).
Prepare your meta-information files with information about tissue types, operators, batch info and libtype. Once you have the total count table of all your data you can normalize the counts to logCPM and perform a PCA bi-plot of MDS to see if your biological hypothesis is holding strongly or the batches. If batches do then you will have to correct for it or you them as information of covariates and perform your DE analysis. This can be possible but keep in mind if your batch effects and libType are too strong of confounders then corrections will not be great and a chance of overfitting comes into play.

Approach 2:

Alternatively one can perform separately the DE analysis for each of the labs or studies(provided each study has enough samples for DE analysis) so and then compare the DEGs that are in common and try to reason the biological question you want to address. Keep in mind you might have also low overlaps.

It is a very broad question. As of now, I can suggest these 2 approaches but unless you interrogate the data and perform a preliminary exploratory analysis, it is difficult to say. If the data are very homogenous and batch effects do not mask the real biological differences approach 1 should work as well for meaniningful hypothesis and even for that matter approach 2.

score 1 · Answer 2 · 2017-11-22

1

Entering edit mode

6.4 years ago

WouterDeCoster 47k

If you mean you can compare group A with library prep 1 on HiSeq 1500 versus group B with library prep 2 on HiSeq 2000: no, the technical variability between sequencers (and definitely between kits) is too big. Better to keep everything the same and only compare within-run/within-experiment.

ADD COMMENT • link 6.4 years ago by WouterDeCoster 47k

0

Entering edit mode

It largely boils down to what the OP wants to study, be it technical variability that has to be modeled or biological variabilities. But yes different library prep, operators, sequencers, kits will have an impact on the data for sure and will mask your real biological differences. This will be an added problem to the heterogeneity of samples as well. So proper understanding of such feature is required to reduce those effects. But first state your query a bit more specifically, if its just DE for your study or DE that one wants to perform the effects due to the confounders?

ADD REPLY • link 6.4 years ago by ivivek_ngs ★ 5.2k