R Syntax Question, Bioconductor data
3
1
Entering edit mode
5.7 years ago
oars ▴ 200

I’m not sure how familiar folks are with the Bioconductor ">Biobase" datasets but one of the free datasets is called ALL (Acute lymphoblastic leukemia). My goal is to calculate and visualize some basic properties of gene expression levels from the ALL dataset. I’m getting stuck on how to write a line of code that's specifically for patients, and another just for genes. I understand that rows=genes and columns=patients, but I do not have the syntax down to specify patients and genes. For example, I have tried:

> hist(exprs(ALL))

And

> median(exprs(ALL))

[1] 5.469092

I’m not happy with the results and I know I’m doing something wrong, for example…

> hist(exprs(ALL))

…will not calculate the average gene expression levels for each gene. and…

> median(exprs(ALL))

…will not find the average gene expression per gene. Any guidance is super appreciated!

R Bioconductor • 1.5k views
ADD COMMENT
6
Entering edit mode
5.7 years ago

I second ATpoint -- sounds like you need to familiarize yourself with how a matrix is interpreted in R, which is what exprs(ALL) returns.

A matrix is a vector with assigned dimensions, here these dimensions happen to be 2 (rows, columns). So, when you call median(matrix), it will simply take the median of all values within the matrix, ignoring the fact that it has rows and columns. If you want patient- and gene-specific values, you will have to specify that you want the values to be calculated on the rows (for genes) or columns (for patients), as ATpoint has shown.

Here's a toy example to illustrate the point:

> test_mat <- matrix(c(rep(1,5), 1:5,1:5), nrow = 5)
> test_mat
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    2    2
[3,]    1    3    3
[4,]    1    4    4
[5,]    1    5    5

# median of all values
> median(test_mat)
[1] 2

# median for the values of the second column
> median(test_mat[,2])
[1] 3

# median for the values of the fifth row
> median(test_mat[5,])
[1] 5

# median calculated for every row 
> apply(test_mat, 1, median)
[1] 1 2 3 4 5

# median calculated for every column
> apply(test_mat, 2, median)
[1] 1 3 3
ADD COMMENT
0
Entering edit mode

I know from...

>dim(exprs(ALL))

...that my data has 12625 rows (genes) and 128 columns (patients). When I run the following:

apply(exprs(ALL), 1, median)

I'm expecting a single middle number for the entire dataset but instead I'm getting a super long list (below is a sample):

                    1000_at                       1001_at 
                   7.569005                    4.965856 
                  1002_f_at                   1003_s_at 
                   3.909525                    6.021260 
                    1004_at                     1005_at 
                   5.852087                    9.022537 
                    1006_at                   1007_s_at 
                   3.645946                    7.285423 
                  1008_f_at                     1009_at 
                   8.682517                    9.276952 
                   100_g_at                     1010_at

For example,

> median(exprs(ALL))
[1] 5.469092

Also, for a histogram I know that:

hist(exprs(ALL))

...works, however,

hist(exprs(ALL, 1))

for rows (genes), and

hist(exprs(ALL, 2))

for columns (patients) does not work, why?

Also, many thanks for your reply post! Very helpful!!

ADD REPLY
2
Entering edit mode

First, you are misunderstanding apply()

apply(exprs(ALL), 1, median)

Will calculate the median for each gene (row), so you will have a median value for each gene, that is, 12625 median values. And:

apply(exprs(ALL), 2, median)

Will calculate the median for each patient (column), so you will have a median value for each gene, that is, 128 median values

For the histograms, you need to use apply() with the hist call hist() as well. To create an histogram of median patient values:

hist( apply(exprs(ALL), 2, median) )
ADD REPLY
0
Entering edit mode

Many thanks - this was very helpful!

ADD REPLY
1
Entering edit mode

EDIT: The problem was with my eyesight, not your command. Apologies!

Are you even reading the answers? I see no apply in your commands.

ADD REPLY
1
Entering edit mode

You wanted per-gene medians, right? How would that be one number? You'll need to use the apply() within histogram. apply() does not change the dataset it works on. Most commands in most languages operate on read-only copies of their input arguments, so changing an input argument would be unexpected behavior.

ADD REPLY
0
Entering edit mode

I think you are correct, thanks Ram.

So what am I returning with the following command?

> median(exprs(ALL))
[1] 5.469092
ADD REPLY
2
Entering edit mode

the median of all m-by-n numbers. Read Friederike's examples carefully.

ADD REPLY
1
Entering edit mode

What you're getting is called a named list. You can use unname or any relevant tidyverse converter method (tibble::rownames_to_column) comes to mind to get the names as a separate column/row.

ADD REPLY
1
Entering edit mode

I strongly doubt that hist(exprs(ALL, 1)) works. You gotta be a bit more careful here. Commas matter, so do brackets.

Also notice how your two example commands (if the comma is placed elsewhere) return two very different things, which, unsurprisingly, leads hist to do very different things with them.

> str(exprs(ALL),1)
 num [1:12625, 1:128] 7.6 5.05 3.9 5.9 5.93 ...
 - attr(*, "dimnames")=List of 2
> str(exprs(ALL),2)
 num [1:12625, 1:128] 7.6 5.05 3.9 5.9 5.93 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
  ..$ : chr [1:128] "01005" "01010" "03002" "04006" ...
ADD REPLY
0
Entering edit mode

Many thanks Friederike - your examples were very helpful.

ADD REPLY
0
Entering edit mode

you're welcome. good luck with your analyses!

ADD REPLY
3
Entering edit mode
5.7 years ago
ATpoint 81k

How about apply(exprs(ALL), 1, median). apply calls a function (here median), on each row (1) or column (2) of x, where x is exprs(ALL).

ADD COMMENT
1
Entering edit mode
5.7 years ago
Ram 43k

You can use apply, as ATpoint mentions, or doBy::summaryBy(). What this does is summarizes a column partitioning it by values in another column. For example:

R> data
gene    Exp
A    1
A    2
A    3
B    4
B    5
B    6

R> summaryBy(data, Exp~gene, FUN=median)

will give you median Exp per gene

gene    median(Exp)
A    2
B    5

The above is hand-written and untested, so be a little careful.

ADD COMMENT

Login before adding your answer.

Traffic: 2989 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6