Question

Pre-processing of label-free MS proteomics data

0

Entering edit mode

6.6 years ago

JessieLenwell • 0

Dear all,

I have a label-free MS dataset (output from MaxQuant) consisting of intensities for approx. 1000 proteins (rows) and 95 samples (columns) consisting of healthy (40 biological replicates) and diseased (55 biological replicates) individuals.

I want to see if there are differentially expressed proteins between the 2 groups.

However, I’m very concerned that my data pre-processing steps have been insufficient. I don’t have anyone to ask for guidance in the group I’m working in - there are no statisticians and everyone just uses Perseus. I’ve figured out all the following steps on my own and now I don’t know how to proceed without guidance/feedback from anyone who is used of working with this kind of data in R.

I filtered out the contaminants, reverse etc.
I also filtered out proteins that had more than 50% missing values.
I used the ‘Normalyzer’ package to try to help figure out which method of normalization would be best for my dataset and based on the results decided to do a median normalization for each sample.

However, I don’t think this is enough to ‘fix’ my dataset. Some sources suggest that I need remove biases using a housekeeping protein (e.g. actin, which I have found is present in all my samples) but I can’t find a calculation describing how exactly to do this. I also read somewhere that I need to correct for protein sequence length. Does anyone have experience with these? Are they standard procedures in the pre-processing of proteomics MS data?

The median normalization leaves me with zero values in my dataset, which I think might interfere with the downstream packages (limma & ROTS) that I want to use as they expect log2 transformed values. Limma’s lm fit function takes whatever matrix it is given but I think it expects that the values are log2 transformed. ROTS doesn’t specify that I should log2 transform my dataset, but when I use the ROTS command, I get a warning saying that it suspects the data is not in log2 scale…which is worrying. Does anyone have any experience with this? Do you use these packages on your proteomics MS data? How do you prepare your data for use with these packages?

Nevertheless, I’ve been able to run 2 independent analyses (using limma and ROTS) to identify differentially abundant proteins using my median-normalized, non-log transformed dataset but I don’t think I can trust the top table results for all the reasons outlined above.

I’m aware that imputation of missing values might help solve the -lnf issue that I have when/if I need to log2 transform. I’ve been reading so much about the different kinds of missing data (MAR, MCAR, MNAR) and all the different procedures to impute and it’s quite overwhelming. From what I gather, imputation should be carried out after normalization and if you’re not sure of the reason for the missing values (as is the case for my dataset) it’s best to use a MAR/MCAR method. Is it standard procedure to impute? How do you handle to missing values in your dataset?

Any guidance or feedback about any of the above issues would help me very much. Thanks in advance.

proteomics Preprocessing label-free MS • 2.1k views

ADD COMMENT • link updated 6.6 years ago by theobroma22 ★ 1.2k • written 6.6 years ago by JessieLenwell • 0

score 1 · Answer 1 · 2017-10-02

1

Entering edit mode

6.6 years ago

theobroma22 ★ 1.2k

Starting with the raw data values, you could mean center and scale the data, perhaps after log base10 transformation but not necessary, and then do principal components analysis.

ADD COMMENT • link 6.6 years ago by theobroma22 ★ 1.2k