R packages, where they come from and where they go

r-packages
infrastructure
Author

Samuel Colin

Published

August 13, 2024

This is the first article of a series on R packages installation.

I work in a relatively R-centric environment. The vast majority of our analyses and, maybe more surprisingly for some, of our ETL pipelines, are written in R.

The code development, testing, modularization and execution is done with a variety of tools such as:

The team I work in is in charge of maintaining and developing these tools. And while, at the start, we simply used the standard R machinery for package management (install.packages() and user libraries), we shifted toward a more intentional R execution environment management with time going.

The current challenge

Recently, we identified a pain point in our workflow: Our Container build (for ETL and unit tests) time is excessive, with about 40 minutes. The main reason for this is dependencies installation time. Indeed, a single package such as duckDB can take several minutes to install. And if you have several of these as requirement, long build times can easily become the rule rather than the exception.

In order to make this quicker, I started reading on R package installation. A very good entry point for my use case was the Rocker Project guide on images extension. There, and on various places, I stumbled across tools, terms and concepts such as pak, r2u, bspm and install2r. As a seasoned R user, I had already encountered some of these before, but without feeling the need to deep-dive.

Feeling a bit lost in all this, I wrote a Mastodon post, asking for a guide explaining these terms. Given the lack of success, I decided to write one myself :)

Some background

Before understanding what these words mean, it is necessary to present several concepts around R package management. If you are new to R, you definitely don’t need to know all this, but as you work with the language and the ecosystem and get to the point where you ask yourself questions such as “how can I make my analysis reproducible?”, “How can I have multiple package versions on my computer at the same time?”, “how do I ensure that all users use the same package versions across the entire organisation?”, …, having a good understanding of these topics certainly help.

Packages distribution form

R Packages come in two forms 1:

  • Precompiled binary
  • Source code

Binary packages are compressed folder with OS specific, efficient & sometimes compiled code (such as C code). Most of the time, they are the prefered form of package, as their installation is relatively quick and does not require much, apart for some platform-specific dependencies 2

Source packages, by contrast, are archives that contain the “raw” R code. Installing them requires more work and dedicated tools, such as the r-base-dev system package on Ubuntu or RTools on Windows. On the good side, they can run (in theory) more efficiently & are universal (as long as they don’t rely on os-specific dependencies). And sometimes, this is the only package format available 3.

Packages repositories

Packages are accessible from different sources, which I will call repository to avoid confusion with source code.

The most well-known source in the R world is CRAN. It is configured by default in R and packages must withstand a series of tests before being accepted. When you use install.packages() without further specification, this is the package repository you use. There are however other sources such as:

All these repositories offer almost all their packages in source form. CRAN typically also provides binaries for Windows and macOS, but not Linux, while the PPM offers binaries for a selection of Linux distributions.

Libraries

Packages are installed to libraries, which are simply folders in which R looks for when it wants to load a package. Calling .libPaths() in an R session tells you which libraries R currently considers.

We can distinguish three kinds of (system) libraries 4:

  • The default library, where the base & recommended packages, such as matrix, are installed. Example: /usr/lib/R/library
  • The user library, where the user-specific packages are installed. Example: /home/samuel/R/x86_64-pc-linux-gnu-library/4.2
  • A site library. This is a centralized place where packages that are neither from the base nor the recommended set can be installed & managed by administrator and be made available to all users. Example location: /usr/lib/R/site-library

On a single person setup, there may be no site library, as the need for a common multi-user package repository is not present. But this can be a way to ensure that all people on a specific machine use exactly the same package versions.

There is another type of library than the system libraries, called project libraries. This is something that we will present once we discuss the renv package and will ignore for now.

Footnotes

  1. Actually, there are more forms, but these are the two that are relevant for us. See the chapter on package structure of Hadley Wickham & Jennifer Bryan’s excellent book, R Packages, for more details : https://r-pkgs.org/Structure.html. This section is heavily inspired by this chapter.↩︎

  2. An example of such dependencies would be libcurl on Linux, which the curl package requires.↩︎

  3. In some rare cases, for instance with proprietary R packages (yes, that exists), it can be that a package is only available as a binary. EEG https://github.com/Teradata/tdplyr↩︎

  4. Thanks to Kevin Ushey for the nice explanation: https://rstudio.github.io/renv/articles/renv.html↩︎