Question

Standardise values in R data frame

1

Entering edit mode

5.4 years ago

paolo002 ▴ 160

Hi all

I have a data frame with various SNPs and other columns with some information and values such as the following:

SNP ID.      location    chromosome    values column A.        values column B
rs8662689     78654         1             0.6432                    0.2458
rs753279     1009753        7            -1.6434                    1.9876
rs4331780     2086433       22            4.521                    -3.743
and so on......                 .....                 ......

I would like to standardise the values in column B. I understand I have to divide the column B in various frequency bins ( I have 20x10^6 rows so I guess I need to set a bin value by which to divide, for instance 1000). Then I would like to calculate the mean of each bin and the standard deviation and divide each value in column B by that particular mean and standard deviation calculated previously in each bin and create a new column. Anybody knows how to do this by a R code?

I have written something like this but it does not seem to work, might be wrong:

library(dplyr)
n_bins = 1000
outscore = df %>% mutate(bin=ntile(mean(df$valuesB),n_bins)) %>% 
  group_by(bin) %>% mutate(zscore=scale(mean()),outlier=abs(zscore)>1.7)

Any help highly appreciated. Thanks

R • 1.8k views

ADD COMMENT • link 5.4 years ago by paolo002 ▴ 160

0

Entering edit mode

for z score calculation, R has scale function.

ADD REPLY • link 5.4 years ago by cpad0112 21k

0

Entering edit mode

Yes, I know, but if I apply only scale to the column it calculates the scores for each value of column B but it is not dividing by frequency bins

ADD REPLY • link 5.4 years ago by paolo002 ▴ 160

0

Entering edit mode

Not clear what you want to do, maybe provide example input data, and expected output? Also, this ntile(mean(df$valuesB),n_bins) creates only 1 bin, and this scale(mean()) meant to be scale(mean(valuesB)) ?

ADD REPLY • link 5.4 years ago by zx8754 11k

0

Entering edit mode

Apologies for the not so clear explanation Here is a sample of my data, first 20 rows:

      Location  iHH_A1  iHH_B1  iHH_P1    XPEHH
1  rs117712813 16204.6 12744.9 13735.6 0.240170
2  rs186953131 16276.3 13073.5 13948.8 0.219126
3    10:127358 16276.3 13073.5 13948.8 0.219126
4    10:127758 16276.3 13073.5 13948.8 0.219126
5   rs78617290 16276.3 12984.2 13895.4 0.225974
6  rs111769723 16276.3 13006.1 13907.8 0.224292
7    rs9419541 15830.9 12618.6 13501.5 0.226793
8    rs2928098 14876.2 11812.6 12623.9 0.230596
9  rs113714579 15722.9 12533.7 13405.7 0.226698
10   rs2928100 15398.9 12197.0 13079.0 0.233101
11   10:128988 16814.9 13470.8 14409.6 0.221744
12   10:129015 16814.9 13470.8 14409.6 0.221744
13   rs9419543 16771.4 13330.5 14294.4 0.229620
14 rs117825552 16898.4 13300.9 14311.2 0.239391
15 rs111736399 16970.0 13577.1 14505.8 0.223062
16   10:129400 16970.0 13494.7 14448.5 0.229148
17 rs144184182 16970.0 13529.7 14467.8 0.226555
18 rs189429487 16970.0 13529.7 14467.8 0.226555
19 rs111738234 16970.0 13529.7 14467.8 0.226555
20 rs113712674 16970.0 13529.7 14467.8 0.226555

I would like to standardise the values of the XPEHH column. I guess from what I understand I need to divide the values in bins and calculate the z-scores on the separate bins. So in this example for instance if I group by 10 bins I calculate the mean of the first 10 values and their standard deviation and calculate the z-score of those 10 SNPs based on the mean and SD previously calculated. The following is again done on the following 10 values/SNPs. Hope that is a bit clearer. Yes, I meant scale(mean(valuesB)) If I do only df$zscores<-scale(df$XPEHH) it calculates the zscore globally, I want to calculates the zscores on the separate bins.

ADD REPLY • link 5.4 years ago by paolo002 ▴ 160

0

Entering edit mode

Thank you very much, I think I can see this is most likely what I need because in the bin column it is grouping SNPs with similar values and then it calculates the zscore. Thanks a lot.

ADD REPLY • link 5.4 years ago by paolo002 ▴ 160

0

Entering edit mode

Please use "add comment" link to add comments, at the moment you are posting into "Add your answer" box which is only for Answers. If the answer worked for you, please consider to accept/upvote, so we can have your question as "resolved".

ADD REPLY • link 5.4 years ago by zx8754 11k

score 1 · Answer 1 · 2018-11-21

Try this exampe, see if that is what you need:

library(dplyr)

n_bins = 2

#example data
df <- read.table(text = "      Location  iHH_A1  iHH_B1  iHH_P1    XPEHH
1  rs117712813 16204.6 12744.9 13735.6 0.240170
2  rs186953131 16276.3 13073.5 13948.8 0.219126
3    10:127358 16276.3 13073.5 13948.8 0.219126
4    10:127758 16276.3 13073.5 13948.8 0.219126
5   rs78617290 16276.3 12984.2 13895.4 0.225974
6  rs111769723 16276.3 13006.1 13907.8 0.224292
7    rs9419541 15830.9 12618.6 13501.5 0.226793
8    rs2928098 14876.2 11812.6 12623.9 0.230596
9  rs113714579 15722.9 12533.7 13405.7 0.226698
10   rs2928100 15398.9 12197.0 13079.0 0.233101
11   10:128988 16814.9 13470.8 14409.6 0.221744
12   10:129015 16814.9 13470.8 14409.6 0.221744
13   rs9419543 16771.4 13330.5 14294.4 0.229620
14 rs117825552 16898.4 13300.9 14311.2 0.239391
15 rs111736399 16970.0 13577.1 14505.8 0.223062
16   10:129400 16970.0 13494.7 14448.5 0.229148
17 rs144184182 16970.0 13529.7 14467.8 0.226555
18 rs189429487 16970.0 13529.7 14467.8 0.226555
19 rs111738234 16970.0 13529.7 14467.8 0.226555
20 rs113712674 16970.0 13529.7 14467.8 0.226555", header = TRUE)

res <- df %>%
  mutate(bin = ntile(XPEHH, n_bins)) %>% 
  group_by(bin) %>%
  mutate(zscore = scale(XPEHH),
         outlier = abs(zscore) > 1.7) %>% 
  ungroup()

res
# # A tibble: 20 x 8
#    Location    iHH_A1 iHH_B1 iHH_P1 XPEHH   bin  zscore outlier
#    <fct>        <dbl>  <dbl>  <dbl> <dbl> <int>   <dbl> <lgl>  
#  1 rs117712813 16205. 12745. 13736. 0.240     2  1.80   TRUE   
#  2 rs186953131 16276. 13074. 13949. 0.219     1 -1.19   FALSE  
#  3 10:127358   16276. 13074. 13949. 0.219     1 -1.19   FALSE  
#  4 10:127758   16276. 13074. 13949. 0.219     1 -1.19   FALSE  
#  5 rs78617290  16276. 12984. 13895. 0.226     1  1.07   FALSE  
#  6 rs111769723 16276. 13006. 13908. 0.224     1  0.513  FALSE  
#  7 rs9419541   15831. 12619. 13502. 0.227     2 -0.788  FALSE  
#  8 rs2928098   14876. 11813. 12624. 0.231     2 -0.0517 FALSE  
#  9 rs113714579 15723. 12534. 13406. 0.227     2 -0.807  FALSE  
# 10 rs2928100   15399. 12197  13079  0.233     2  0.433  FALSE  
# 11 10:128988   16815. 13471. 14410. 0.222     1 -0.324  FALSE  
# 12 10:129015   16815. 13471. 14410. 0.222     1 -0.324  FALSE  
# 13 rs9419543   16771. 13330. 14294. 0.230     2 -0.241  FALSE  
# 14 rs117825552 16898. 13301. 14311. 0.239     2  1.65   FALSE  
# 15 rs111736399 16970  13577. 14506. 0.223     1  0.109  FALSE  
# 16 10:129400   16970  13495. 14448. 0.229     2 -0.332  FALSE  
# 17 rs144184182 16970  13530. 14468. 0.227     1  1.26   FALSE  
# 18 rs189429487 16970  13530. 14468. 0.227     1  1.26   FALSE  
# 19 rs111738234 16970  13530. 14468. 0.227     2 -0.834  FALSE  
# 20 rs113712674 16970  13530. 14468. 0.227     2 -0.834  FALSE