Question

Finding differential abundance in protein spectral count data

3

Entering edit mode

9.8 years ago

benaneely ▴ 70

I have recently been exposed to the more sophisticated methods of normalization and significance testing being employed to analyze NGS data, and can see the similarities with peptide spectral count data. That being said, I can't seem to make a reasonable choice based on whether key assumptions are being met. Spectral count data is problematic due to zeros throughout the data, and it is not normally distributed. For comparing two groups, in the literature a rank sum or signed rank test is appropriate, or a t-test on log transformed data.

My specific data set is paired tumor normal samples (~800 features with n = 80). Using total ion count normalized data, I have evaluated the data with a signed rank test, and using the log2 ratios of each pair, have performed a moderated t-test via limma (I saw this used by Castello et al., 2012 "Insights into RNA Biology from an Atlas of Mammalian mRNA-Binding Proteins" but looking at the log2 data, it is still no completely normal). Correcting for FDR, these results make sense given the experiment and follow-up experiments. But the beauty of edgeR and DESeq2 involves the normalization procedures being employed and later used in the model, and so I would need to use non-normalized count data (which I haven't done yet). Alternatively, I saw an excellent discussion about the similarities of NSAF data (normalized spectral abundance factor; this is essentially normalizing spectral counts to protein size, similar in concept to RPKM) to GeneChip data (Pavelka et al., 2008 "Statistical similarities between transcriptomics and quantitative shotgun proteomics data"). This was used as rationale to employ the PLGEM (Power Law Global Error Model) to evaluate NSAF data (same paper). Lastly, I have thought about using the voom-limma approach on the count data directly (non-normalized I assume).

There is obvious signal in this data set, but I would hate to miss an opportunity to use a newer method of analysis. Any thoughts or recommendations are very welcome.

EdgeR proteomics DESeq2 PLGEM voom-limma • 4.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.8 years ago by benaneely ▴ 70

1

Entering edit mode

I hope someone with experience with this type of data replies. But for what it's worth, moderate violations of normality are generally not a big deal for linear model based statistics (e.g., that used by limma), particularly with a sample size such as yours.

ADD REPLY • link 9.8 years ago by Devon Ryan 104k