Question

Creating binary matrix of absence/presence of genes in multiple column lists

2

Entering edit mode

5.1 years ago

lmu ▴ 20

Hello,

I'm very new to R, and I have a large data set of the presence/absence of enriched genes in the form of lists of gene names for 10 different cell lines. I am trying to display the overlap in significantly enriched genes across the different cell lines using the UpSetR package.

Currently my data looks like this, but each list is between 200 and 900 genes long

current data format

enter image description here

Whereas I want to display the data in a binary matrix format to indicate presence/absence of each unique gene in the overall set in each individual cell line list.

desired data format

enter image description here

I have been able to compile a reference column of all the unique genes across the three lists, however, I am now very stuck on how to use that reference list to convert the different gene lists into a binary format, and was wondering if someone could point me in the right direction to solving this.

Many thanks in advance for any help!

upsetr R • 4.7k views

ADD COMMENT • link updated 12 months ago by zx8754 11k • written 5.1 years ago by lmu ▴ 20

1

Entering edit mode

Hi there, have you found a solution for this task? I am in the same situation

ADD REPLY • link 12 months ago by Tanya ▴ 10

score 1 · Answer 1 · 2019-03-09

> df = data.frame(
+     line1 = sample(LETTERS, 5),
+     line2 = sample(LETTERS, 5),
+     line3 = sample(LETTERS, 5)
+ )
> df
  line1 line2 line3
1     P     T     P
2     B     H     A
3     A     F     C
4     K     P     N
5     G     L     Y
> library(dplyr)
> library(tidyr)
> library(tibble)
> as.data.frame(t(df)) %>%
+     rownames_to_column(var = "Gene") %>%
+     gather(value,variable,-Gene) %>%
+     spread(variable,value)  %>%
+     mutate_at(vars(-Gene), funs ( ifelse ( is.na(.), 0, 1)))
   Gene A B C F G H K L N P T Y
1 line1 1 1 0 0 1 0 1 0 0 1 0 0
2 line2 0 0 0 1 0 1 0 1 0 1 1 0
3 line3 1 0 1 0 0 0 0 0 1 1 0 1
Warning message:
attributes are not identical across measure variables;
they will be dropped

score 1 · Answer 2 · 2023-04-03

cpad already answered it, but here's an answer using more contemporary tidyverse syntax.

example data.

df <- structure(list(LINE1 = c("A", "B", "C", NA, NA), LINE2 = c("B", 
"D", "E", "F", NA), LINE3 = c("C", "D", "G", "H", "I")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -5L))

# A tibble: 5 × 3
  LINE1 LINE2 LINE3
  <chr> <chr> <chr>
1 A     B     C    
2 B     D     D    
3 C     E     G    
4 NA    F     H    
5 NA    NA    I

Tidyverse solution.

library("tidyr")
library("dplyr")

df |>
  pivot_longer(everything()) |>
  drop_na() |>
  mutate(binary=1L) |>
  pivot_wider(names_from=value, values_from=binary, values_fill=0L)

# A tibble: 3 × 10
  name      A     B     C     D     E     G     F     H     I
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 LINE1     1     1     1     0     0     0     0     0     0
2 LINE2     0     1     0     1     1     0     1     0     0
3 LINE3     0     0     1     1     0     1     0     1     1

score 1 · Answer 3 · 2023-04-20

Using base stack and table (example data is from rpolicastro answer):

as.data.frame.matrix(table(stack(df)))
#   LINE1 LINE2 LINE3
# A     1     0     0
# B     1     1     0
# C     1     0     1
# D     0     1     1
# E     0     1     0
# F     0     1     0
# G     0     0     1
# H     0     0     1
# I     0     0     1