Question

A Theoretical Question About RNA-Seq Normalization

0

Entering edit mode

6.4 years ago

maegsul ▴ 170

Let's assume you want to conduct an eQTL study, and you have RNA-Seq data of 100 samples, but you only have genotyping info for 70 of them. Anyway, you want to conduct eQTL mapping using the available data at this moment.

One step here is to normalize expression values between samples (using TMM by edgeR package, for example). Another step is to QC the expression data for each gene based on gene expression thresholds and read thresholds, and you only take into account these genes if 20% of your samples are surpassing these thresholds (similar to GTEx pipeline).

Would you do all these using 100 samples that have expression data available, or only a subset of them (n=70) which have both expression + genetic data available? In future you may have up to 100 samples genotyped, but my question is related to the situation today. Which one of these two options would be more biologically accurate?

1) Take only expression data of 70 samples into account (the ones that you have genetic info) and conduct the downstream studies

2) Take 100 samples into account for expression normalization and expression QC based on thresholds mentioned above; but then extract the QCd & normalized data for only 70 of them for genotype-expression studies, for now. Whenever you have new samples genotyped, rerun your analysis based on the same gene expression quantification table (normalized & QCd); only change genotype parameter available for additional samples later on

RNA-Seq eQTL • 1.7k views

ADD COMMENT • link updated 6.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k • written 6.4 years ago by maegsul ▴ 170

1

Entering edit mode

You should specify the purpose of normalization. Why do you want to normalize these value in the first place?

What comparisons do you intend to perform?

In general, normalization should be performed to characterize samples where there is a reason to expect that the differences are few, isolated and that most of the data did not change radically. As you keep adding more and more samples it becomes less likely that the assumptions hold robustly, hence it may undermine the validity of the method.

ADD REPLY • link 6.4 years ago by Istvan Albert 100k

0

Entering edit mode

Thank you for your answer. The purpose of normalization here (i.e. TMM method) is to account for library size variation between samples, prior to the analysis. It's a straightforward approach for most of the differential expression studies, but it is also used in eQTL studies (see: https://www.gtexportal.org/home/documentationPage#staticTextAnalysisMethods , under "eQTL analysis" - Expression). Then, I plan to identify putative regulatory variants (i.e. eQTLs) that are correlated with gene expression, using the respective genetic data.

ADD REPLY • link 6.4 years ago by maegsul ▴ 170

score 0 · Answer 1 · 2017-11-24

0

Entering edit mode

6.4 years ago

Kristoffer Vitting-Seerup ★ 4.0k

I would do option no 1 since else you are letting the 30 libraries without genotype data affect the normalisation - although I doubt it have a huge impact with the number of samples you are analysing.

ADD COMMENT • link 6.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k