Why Pipe?

Question

Tutorial:Pipes in R: An introduction and some advanced tips and tricks.

9

Entering edit mode

3.1 years ago

rpolicastro 13k

Why Pipe?

The magrittr library in R adds the pipe operator %>%, which allows you to use the output of one function as the input of another function. This is conceptually similar to the pipe operator | used in linux.

To understand how this can often make code simpler and easier to follow, let’s take an example from the iris dataset in R. Let’s say that we wanted to subset the Species and Petal.Width columns, and then select only the setosa species. With pipes, the operation would look like this (using tidyverse).

library("tidyverse")

df <- iris %>%
  select(Petal.Width, Species) %>%
  filter(Species == "setosa")

Without pipes, you could simply perform each operation individually, but this adds some redundancy since you are saving to intermediate variables, and you are always operating on the same input.

df <- select(iris, Petal.Width, Species)
df <- filter(df, Species == "setosa")

Your other alternative is to nest the functions, which can get unwieldy when nesting multiple functions. The order of operations for the functions can also become difficult to follow.

df <- filter(select(iris, Petal.Width, Species), Species == "setosa")

What is a Pipe?

The syntax %>% looks rather strange, but makes sense in the context of R operators. R has numerous operators that come between two values, such as + summing the number before and after it 1 + 2, or <- to assign the value on the right to the name/symbol on the left number <- 1. These functions are of the form of infix, which is different to the more traditional functional form of prefix which has the function name followed by the arguments in parenthesis sum(1, 2).

What is fun about R is that you can change these infix functions to prefix form by surrounding them in backticks.

> 1 + 2
[1] 3
> `+`(1, 2)
[1] 3

Since they are really functions that take two inputs, you can also do silly things like reassign the infix functions to do something else.

`+` <- function(x, y) x - y
1 + 2
> [1] -1

# Reset your plus sign if you want to try the above code.
# Restarting R will also revert it.
`+` <- base::`+`

R lets you make your own infix functions by surrounding them with %. You may have also seen some of these defined in the base library as well, such as the commonly used %in% operator.

1 %in% 1:10
[1] TRUE

I’ll make an example infix function that pastes the value on the left to the value on the right using a comma.

`%comma%` <- function(x, y) paste(x, y, sep=",")

"cat" %comma% "dog"
[1] "cat,dog"

`%comma%`("cat", "dog")
[1] "cat,dog"

The magrittr pipe can be thought of as an infix function, and for the most part will behave as an infix function.

1:5 %>% sum
[1] 15

`%>%`(1:5, sum)
[1] 15

How do pipes behave?

Magrittr pipes behave slightly differently depending on how the operation on the right of the pipe is set up.

By default, the magrittr pipe will always place the output from the left as the first argument on the right.

filter(iris, Species == "setosa")

# is the same as...

iris %>% filter(Species == "setosa")

The output from the left can also be represented as a . on the right, so the above piped example is also equivalent to.

iris %>% filter(., Species == "setosa")

The . representation is handy, because it lets you specify the output on the left as the input to an argument on the right that isn’t the first argument.

lm(Petal.Width ~ Petal.Length, data=iris)

# is the same as...

iris %>% lm(Petal.Width ~ Petal.Length, data=.)

Be warned though, that if you put the . in a function on the right, the magrittr pipe will assume that you also want it as the first argument. Ignore that nesting filter within lm is antithetical to piping for now.

iris %>% lm(Petal.Width ~ Petal.Length, data=filter(., Species == "setosa"))

# The code above will throw an error, because the pipe is also adding iris as the first argument.

iris %>% lm(., Petal.Width ~ Petal.Length, data=filter(., Species == "setosa"))

You can get around this problem by surrounding the right side in squiggly brackets.

iris %>% {lm(Petal.Width ~ Petal.Length, data=filter(., Species == "setosa"))}

The squiggly brackets turn the pipe into what is almost an anonymous function, which gives you a lot of flexibility with . notation.

1:5 %>% {(. + 1) / 2}
[1] 1.0 1.5 2.0 2.5 3.0

We’ll get back to this point later.

As a final note, if the piped input is the only argument to the function on the right, you don't even need parenthesis.

1:5 %>% sum
[1] 15

Advanced Pipe tips and tricks.

Here are various tips and tricks when using pipes. Let’s make a matrix that we’ll use for a few of these examples.

mat <- matrix(1:6, ncol=2)

> mat
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

Replacement functions and Setters

An example of a replacement function (or a setter) is the colnames()<- function, which can be used to add column names to our matrix. Traditionally, this would take the following form.

colnames(mat) <- LETTERS[1:2]

R is a bit strange in that this function is actually called colnames<-, and similar to how we could wrap infix functions in backticks to turn it back into prefix form, we can do the same with replacement/setter functions to turn them into a traditional function with two arguments.

mat <- matrix(1:6, ncol=2)
`colnames<-`(mat, LETTERS[1:2])

     A B
[1,] 1 4
[2,] 2 5
[3,] 3 6

This means that if we wanted to set the colnames and rownames within a pipe, we are free to do so.

mat %>%
  `colnames<-`(LETTERS[1:2]) %>%
  `rownames<-`(paste0("gene", 1:3))

      A B
gene1 1 4
gene2 2 5
gene3 3 6

For popular replacement functions, magrittr also provides some convenience functions. For example, colnames<- is equivalent to magrittr::set_colnames. You can see a complete list in the help documentation for ?magrittr::set_names.

Infix functions.

Since backticks can be used to turn infix functions into prefix functions, you can use them within pipes.

mat %>% `+`(1)
     [,1] [,2]
[1,]    2    5
[2,]    3    6
[3,]    4    7

In R, even things such as [ and [<- are infix functions (the second of which is a replacement function) that can be used within a pipe.

mat[, 2] <- 1
> mat
     [,1] [,2]
[1,]    1    1
[2,]    2    1
[3,]    3    1

# is the same as...

mat <- matrix(1:6, ncol=2)
`[<-`(mat, , 2, value=1)

# is the same as…

mat %>% `[<-`(, 2, value=1)

Magrittr has popular convenience functions for many of these infix functions that you can find in the documentation magrittr::set_names. For example, + would be add().

Anonymous functions.

We talked about it earlier, but remember that squiggly brackets are similar to anonymous functions.

mat %>% {m <- .; m <- m * 2; m <- m + 1} %>% print

     [,1] [,2]
[1,]    3    9
[2,]    5   11
[3,]    7   13

You can also wrap a function with parenthesis to turn it into another form of anonymous function.

mat %>% (function(x) {x <- x * 2; x <- x + 1}) %>% print

Making functions with pipes.

A . on the leftmost part of a pipe chain will turn that pipe chain into a named function.

paster <- . %>% paste("cat") %>% paste("dog")

> paster("bob")
[1] "bob cat dog"

The above would be equivalent to.

paster <- function(x) {x <- paste(x, "cat"); paste(x, "dog")}

paster("bob")
[1] "bob cat dog"

Pipes in packages.

If you want to use pipes in an R package, you need to run usethis::use_pipes, which will perform a few tasks to make piping available within your functions.

Synthesizing your knowledge.

Now that you know how pipes work, using them should be easy! Let’s take our bare-bones matrix from before and make a nice little plot from it.

First, let’s give the matrix colnames (samples), and rownames (genes). We’ll then log2 transform all of the values.

mat <- matrix(1:6, ncol=2)

mat <- mat %>%
  magrittr::set_colnames(LETTERS[1:2]) %>%
  magrittr::set_rownames(paste0("gene", 1:3)) %>%
  log2

We’ll then convert the matrix into a tibble (a tidyverse data.frame), convert it to long format, and then make a bar plot using ggplot2.

mat %>%
  as_tibble(rownames="gene_name") %>%
  pivot_longer(!gene_name, names_to="sample", values_to="count") %>%
  ggplot(aes(x=gene_name, y=count, fill=sample)) +
    geom_col(position="dodge")

barplot

Closing statements

If you want to learn more about pipes, check out the Pipes chapter from R for Data Science. If you want to learn more about functions, such as infix and replacement functions, read the Functions chapter from Advanced R.

Although pipes are really handy, don’t make your entire script one giant pipe. Try to limit pipes into “task units” of 10 or less piped functions. Try to think of natural break points that separate your code into logical chunks based on the overall task for that chunk of code. For example, you might have a chunk of piped code to import a tsv and ensure proper formatting. And then another chunk of piped code to merge that tsv data into an existing data frame and perform some aggregate statistics. And then a third block of piped code to format the data for plotting and then plotting the data.

As of publication of this tutorial, the magrittr pipe is the de facto standard for piping in R. R is actually planning on adding a native pipe |>, which is available in the development version of R.

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /geode2/home/u070/rpolicas/Carbonate/.conda/envs/R/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] magrittr_2.0.1  forcats_0.5.1   stringr_1.4.0   dplyr_1.0.5    
 [5] purrr_0.3.4     readr_1.4.0     tidyr_1.1.3     tibble_3.1.0   
 [9] ggplot2_3.3.3   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6       cellranger_1.1.0 pillar_1.5.1     compiler_4.0.3  
 [5] dbplyr_2.1.0     tools_4.0.3      jsonlite_1.7.2   lubridate_1.7.10
 [9] lifecycle_1.0.0  gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.10    
[13] reprex_1.0.0     cli_2.3.1        rstudioapi_0.13  DBI_1.1.1       
[17] haven_2.3.1      withr_2.4.1      xml2_1.3.2       httr_1.4.2      
[21] fs_1.5.0         generics_0.1.0   vctrs_0.3.6      hms_1.0.0       
[25] grid_4.0.3       tidyselect_1.1.0 glue_1.4.2       R6_2.5.0        
[29] fansi_0.4.2      readxl_1.3.1     modelr_0.1.8     ps_1.6.0        
[33] backports_1.2.1  scales_1.1.1     ellipsis_0.3.1   rvest_1.0.0     
[37] assertthat_0.2.1 colorspace_2.0-0 utf8_1.2.1       stringi_1.5.3   
[41] munsell_0.5.0    broom_0.7.5      crayon_1.4.1

R • 2.7k views

ADD COMMENT • link 3.1 years ago by rpolicastro 13k