Posts on B101nfo

Packaging R: getting in repositories

Sun, 05 May 2024 00:00:00 +0000

After the previous post collecting information about repositories I want to collect here my thoughts on adding a package in a repository and how repositories are recognized. As in the previous post this is built on the assumption that one already has a package or more and wants to distribute it.

This is meant as a reflection of what is an R repository and not intended for R package developers. However, their feedback is appreciated to consider how an ideal repository would be.

Package submission

An R repository will have a way to incorporate a package. CRAN submission process starts with a form, while Bioconductor is done through a Github issue.

The process will usually then start with an automated process. Until the automated process check hasn’t passed probably no one will look into the package submission. This reduce the hours a human must dedicate to manage submissions. If a man is kept in the loop one could appeal the automatic process contacting them, or if it is a random failing re-submitting the package again.

Package submission checks: first a check of the package, if it is not new a dependency check from the repository if all checks pass the package is added to the repository.

Generally a package must first pass a package quality check before it is considered for further integration test. This integration test is usually checking the new version of a package with packages that depend on it, also known as reverse dependencies.

Package maintenance

Once a package is included in a repository it usually needs to be maintained.

There are many moving pieces, chips architecture, OS, R, other packages. This all lead that authors need to maintain the packages in good shape if they want it to remain useful to users. Of course, if one doesn’t want to do that they do not need to create a repository to share their package.

Graphic showing time and different R versions and checks. Repositories check the packages on them on multiple levels.

This leads that at any given time point there must be some tests for any given package under different conditions as shown in image 2. This leads to the possibility of having a package archived from the repository for failing the checks in place.

Repositories provide these checks as a service to the users. They guarantee that R packages in the repository work well together and pass the same set of packages (mostly). This is what leads to their reputation and usage among users (this is true beyond R, DEBIAN, Ubuntu, …).

Closing remarks

There are several official repositories how the package submission works when a package is submitted to one but it is related, via dependencies to other repositories is a matter of another post.

There are some discussion on what is an R repository. The importance of CRAN and Bioconductor has lead to some confusion. There are generally two meanings of what a cran-like repository is:

One where install.packages() works (This is defined by how the files and binaries are organized and will be a theme for another time).
One were all the checks described here are in place and install.packages() works too.

r-universe is using the first definition but could be used to generate repositories with checks that comply with the second definition. Other repositories that use that are the Posit Public Package Manager, or the R4Pi repository (which provides binaries for Raspberry Pi OS).

As the second definition is more strict I’ll focus on it as this post has explained.

PS: This post might be edited as it has been siting in my computer for several months. I prefer to post it and be improved with feedback, so let me know if you have any addition.

Future FOSS contributions

Tue, 02 Apr 2024 00:00:00 +0000

I write this post after several weeks/months of consideration of why I am feeling a burden when I contribute to the FOSS community, mostly the R community. I want to make public some rules I set for myself for future references and in case it helps others.

How I end up here

I use R for my work and I realized I could use it for my interests. With the creativity of a computer language that I know well, I could answer question I had and help others in the process.

As the questions I answered were more general or more community centered I got involved in more meetings and working groups. Some decisions/commitments were carefully considered, and known to be temporal, others were/are more open ended with an unclear end.

An image of Rataplan the dog of Lucky Luck.

Current situation

I often participate to the R Contributors working group (RCWG), not only meetings but other activities and task.
I am member of the Bioconductor, helping in the forum from time to time¹.
I am member of the Bioconductor code of conduct committee.
I am member of the rOpenSci community, helping in the slack and in the forum if the questions are related to the packages I have in their organization.
I am member of the R repository working group (RRepoWG) from the R Foundation.
I impulsed the Bioconductor Classes and Methods Working Group (although there isn’t much activity for the last year and a half).
I co-organized the local R conference of my country for 2023.
I co-organized the local R user group of my city last year (2-3 events) and I’ve been trying to restart it this year 2024 and contributed to others RUGs.
I maintain 3-4 packages on CRAN and Bioconductor².

I am sure there are several people doing more but after reflection I came to the conclusion that this is not sustainable/worth it for me.

Reasons to keep giving

From now on when I contribute something there are 3 reasons I will keep in mind:

It is part of my work or related to it.

When I go to these working groups I do not represent my employer or any comunity. I am not paid for anything of the previous list and I need to recover the hours when I have meetings during working hours.

Here I also include contributions to something that might make help my career. This includes gig jobs or consulting that I am open to do.
It is funny/rewarding.

These contributions might interest me because I find funny, for example plotting a simple tree with ggplot2.

Or because it is rewarding helping someone to solve their problems, such as helping a family member to claim her wage for their overtime.
I learn something.

I don’t like to learn new things for the sake of learning. But I enjoy learning something that could be useful: a technology, a solution, a community or new data I never analyzed. This might be for my personal interest or work related: recently having learn how to parse html/xml helped me at work doing a task in 5 minutes a simple task my boss would have need half an hour or more but I learned it for a hobby project.

If I see a project/proposal doesn’t fit any of these three I will stop contributing/maintaining. I’ll try to avoid commitments that I think that should be done that I could step up or do it but do not fit in these three rules. This includes contributing to books, mentoring, being the glue between different communities or simply sending a PR.

As my time is more stretched with commitments away from the keyword. I feel torn apart between contributing more effectively or stopping. Each hour I spend in a meeting that could/was an email is 2 hours or more that I lose, not only the opportunity cost but also the motivation³ and the time I spent preparing the meeting.

Future contributions

Aside from that and some previous commitments I will finish serving the community. I will no longer prioritize what is good for the community over what is good for me. If they overlap it will be great if it doesn’t, I’m sorry.

As you now know you can appeal to either of the three motivations now: Ask me something I might find funny/rewarding, ask me to learn something I could use or simply provide a payment or a way forward for my career.

I am open to consulting or developing something, you can contact me useing the email on the blog.

I also receive answers and help too when I occasionally ask too. I also benefit of many questions from online forums (although now it is rare those I search are about R).↩︎
There are many more only in github or some other communities.↩︎
Which frankly lately is very low. I won’t get into details.↩︎

New rtweet release: 2.0.0

Fri, 16 Feb 2024 00:00:00 +0000

This is a brief announcement of rtweet version 2.0.0. This major version changes signals the move from the API v1.1 to the API v2.

There haven’t been many changes since 1.2.1 but this is to signal that the API v1.1 is deprecated.

The previous release was a bit of a rush to meet with the requirements of CRAN maintainers to fix an error and it wasn’t polished. Some users complained that it was difficult to find what worked. In this release I focused mostly to make life easier for users:

Now there is a document the deprecated functions from API v1.1 to API v2: see help("rtweet-deprecated", "rtweet"). I also made it easier for the rtweet to work with API v2: the release of httr2 1.0.0 version helped to avoid some workarounds with the authentication process.

I also focused on updating the vignettes to the most up to date recommendations. I am not sure the streaming vignettes is up to date (but keep reading why I left it as is).

Last, following CRAN policy: if users create rtweet data they can now delete it with client_clean() and auth_clean().

Future releases

For the last year I asked the community for a co-maintainer with interest in the package. Unfortunately, people that showed some interest at the end didn’t commit to it.

At the same time I also asked for donations to support an API access. It currently costs 100€ to access most endpoints which is needed to test and develop the package. However, this is more than half of what I spend in groceries last month.
Other packages like academictwitteR are also stopping development/support. Although not archived from CRAN, it has a note in the README:

Note this repo is now ARCHVIED due to changes to the Twitter API. The paid API means open-source development of this package is no longer feasible.

Similarly without financial help and community interest I won’t invest more time on it.
This is the last version that I release. I have other interests and I would like to focus on other projects. My focus will be on updating and releasing some packages I have. I also want to focus more on my own company to help the R community (and beyond). I will write about the company shortly.

There have been some discussions on social media how to signal deprecation of packages. The only method available on CRAN that I know is to declare a package ORPHANATED. I have requested to CRAN to declared the package ORPHANATED.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.1 (2023-06-16)
##  os       Ubuntu 22.04.4 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language en
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Madrid
##  date     2024-02-24
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.18    2023-06-19 [1] CRAN (R 4.3.1)
##  bookdown      0.37    2023-12-01 [1] CRAN (R 4.3.1)
##  bslib         0.6.1   2023-11-28 [1] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.1)
##  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.1)
##  digest        0.6.34  2024-01-11 [1] CRAN (R 4.3.1)
##  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.2)
##  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
##  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.3.1)
##  jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
##  knitr         1.45    2023-10-30 [1] CRAN (R 4.3.2)
##  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.2)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
##  rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.1)
##  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
##  sass          0.4.8   2023-12-06 [1] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
##  xfun          0.42    2024-02-08 [1] CRAN (R 4.3.1)
##  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.1)
## 
##  [1] /home/lluis/bin/R/4.3.1
##  [2] /opt/R/4.3.1/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Submissions accepted on the first try

Wed, 10 Jan 2024 00:00:00 +0000

Recently someone in social media was saying that they do not succeed on submissions to CRAN on the first try. In this post I’ll try to answer this question.

First we need to know the submissions to CRAN. We can download the last 3 years of CRAN submissions thanks to cransays.

cdh <- cransays::download_history()

Here is the bulk of the analysis of the history of package submissions. This is explained in different posts, but basically I keep only one package per snapshot, try to identify new submissions instead of changes in the same submission and calculate some date-related variables.

library("dplyr", warn.conflicts	 = FALSE)
library("lubridate", warn.conflicts	 = FALSE)
library("tidyr", warn.conflicts	 = FALSE)
diff0 <- structure(0, class = "difftime", units = "hours")
cran <- cdh |> 
  filter(!is.na(version)) |> 
  distinct() |> 
  arrange(package, snapshot_time) |> 
  group_by(package, snapshot_time) |> 
  # Remove some duplicated packages in different folders
  mutate(n = seq_len(n())) |> 
  filter(n == n()) |> 
  ungroup() |> 
  select(-n) |> 
  arrange(package, snapshot_time, version) |> 
  # Packages last seen in queue less than 24 ago are considered same submission 
  # (even if their version number differs)
  mutate(diff_time = difftime(snapshot_time, lag(snapshot_time), units = "hour"),
         diff_time = if_else(is.na(diff_time), diff0, diff_time), # Fill NAs
         diff_v = version != lag(version),
         diff_v = if_else(is.na(diff_v), TRUE, diff_v), # Fill NAs
         near_t = abs(diff_time) <= 24,
         resubmission = !near_t | diff_v, 
         resubmission = if_else(resubmission == FALSE & diff_time == 0, 
                               TRUE, resubmission),
         resubmission_n = cumsum(as.numeric(resubmission)),
         new_version = !near(diff_time, 1, tol = 24) & diff_v, 
         new_version = if_else(new_version == FALSE & diff_time == 0, 
                               TRUE, new_version),
         submission_n = cumsum(as.numeric(new_version)), .by = package) |> 
  select(-diff_time, -diff_v, -new_version, -new_version, -near_t) |> 
  mutate(version = package_version(version, strict = FALSE),
         date = as_date(snapshot_time))

Now we need to compare with the CRAN archive to know if the submission were accepted.

First we need to retrieve the data:

cran_archive <- tools:::CRAN_archive_db()
# When row binding the data.frames that have only one row lose they row name:
# handle those cases to keep the version number:
archived <- vapply(cran_archive, NROW, numeric(1L))
names(cran_archive)[archived == 1L] <- vapply(cran_archive[archived == 1L], rownames, character(1L))
# Merge current and archive data
cran_dates <- do.call(rbind, cran_archive)
cran_dates$type <- "archived"
current <- tools:::CRAN_current_db()
current$type <- "available"
cran_h <- rbind(current, cran_dates)
# Keep minimal CRAN data archive
cran_h$pkg_v <- basename(rownames(cran_h))
rownames(cran_h) <- NULL
cda <- cran_h |> 
  mutate(strcapture(x = pkg_v, "^(.+)_([0-9]*.+).tar.gz$", 
                    proto = data.frame(package = character(), version = character())),
         package = if_else(is.na(package), pkg_v, package)) |> 
  arrange(package, mtime) |> 
  mutate(acceptance_n = seq_len(n()), .by = package) |> 
  select(package, pkg_v, version, acceptance_n, date = mtime, uname, type) |> 
  mutate(date = as_date(date))

We use tools:::CRAN_current_db, because package.available will filter packages based on OS and other options (see the filter argument).

We can make a quick detour to plot the number of accepted articles and when were they first published:

library("ggplot2")
cdas <- cda |> 
  summarize(available = if_else(any(type == "available"), "available", "archived"),
            published = min(date),
            n_published = max(acceptance_n),
            .by = package)

ggplot(cdas) + 
  geom_point(aes(published, n_published, col = available, shape = available)) +
  theme_minimal() +
  theme(legend.position = c(0.7, 0.8), legend.background = element_rect()) +
  labs(x = element_blank(), y = "Versions", col = "Status", shape = "Status",
       title = "First publication of packages and versions published") +
  scale_x_date(expand = expansion(), date_breaks = "2 years", date_labels = "%Y")

In summary, there are 6291 packages archived, and 20304 available. We can observe that there is a package that had more than 150 versions that was later archived.

Now we can really compare the submission process with the CRAN archive:

cran_subm <- cran |> 
  summarise(
    resubmission_n = max(resubmission_n, na.rm = TRUE),
    submission_n = max(submission_n, na.rm = TRUE),
    # The number of submissions 
    submissions = resubmission_n - submission_n + 1,
    date = min(date),
    .by = c("package", "version")) |> 
  arrange(package, version)
# Filter to those packages submitted in the period we have data
cda_acc <- cda |> 
  filter(date >= min(cran_subm$date)) |> 
  select(-pkg_v) |> 
  mutate(version = package_version(version, FALSE))

accepted_subm <- merge(cda_acc, cran_subm, by = c("package", "version"),
             suffixes = c(".cran", ".subm"), all = TRUE, sort = FALSE) |> 
  arrange(package, version, date.cran, date.subm) |> 
  mutate(submissions = if_else(is.na(submissions), 1, submissions),
         acceptance_n = if_else(is.na(acceptance_n), 0, acceptance_n))

We can explore a little bit this data:

lp <- scales::label_percent(accuracy = 0.1)
accepted_subm |> 
  summarize(cransays = sum(!is.na(date.subm)),
            cran = sum(!is.na(date.cran)),
            missed_submissions = cran - cransays,
            percentaged_missed = lp(missed_submissions/cran))

cransays	cran	missed_submissions	percentaged_missed
46525	50413	3888	7.7%

This means that cransays, the package used to archive this data, misses ~8% of submissions, probably because they are handled in less than an hour!! Another explanation might be because for some periods cransays bot didn’t work well…

On the other hand we can look how long does it take for a version to be published on CRAN:

accepted_subm |> 
  filter(!is.na(date.cran)) |> 
  mutate(time_diff = difftime(date.cran, date.subm, units = "weeks")) |>
  # Calculate the number of accepted packages sine the recording of submissions
  mutate(accepted_n = acceptance_n - min(acceptance_n[acceptance_n != 0L], na.rm = TRUE) + 1, .by = package) |> 
  filter(time_diff >= 0) |> 
  ggplot() + 
  geom_point(aes(date.cran, time_diff, col = accepted_n)) +
  theme_minimal() +
  theme(legend.position = c(0.2, 0.8), legend.background = element_rect()) +
  labs(x = "Published on CRAN", title = "Time since submitted to CRAN", 
       y = "Weeks", col = "Accepted")
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

I explored some of those outliers and there is a package that was submitted in 2021 and two years later it was submitted with the same version. In other cases the submission was done with more than 1 hour of tolerance (see the “new_version” variable creation in the second code chunk.)

This means that the path to CRAN might be long and that developers do not change the version number on each submission.

Note: This section is new after detecting problems with the way it was initially published.

In the following function I calculate the number of submissions and similar information for each package:

count_submissions <- function(x) {
  x |> 
    mutate(submission_in_period = seq_len(n()),
           date.mix = pmin(date.cran, date.subm, na.rm = TRUE),
           .by = package, .after = acceptance_n) |> 
    summarise(
      # Number of accepted packages on CRAN
      total_accepted = sum(!is.na(date.cran), 0, na.rm = TRUE),
      # At minimum 0 through {cransays}
      through_cransays = sum(!is.na(date.subm), 0, na.rm = TRUE), 
      # In case same version number is submitted at different timepoints
      resubmissions = ifelse(any(!is.na(resubmission_n)), 
                              max(resubmission_n, na.rm = TRUE) - min(resubmission_n, na.rm = TRUE) - through_cransays, 0),
      resubmissions = if_else(resubmissions < 0L, 0L, resubmissions),
      # All submission + those that were duplicated on the submission system
      total_submissions = max(submission_in_period, na.rm = TRUE) + resubmissions,
      # The submissions that were not successful
      total_attempts = total_submissions - total_accepted,
      percentage_failed_submissions = lp(total_attempts/total_accepted), 
      .by = package)
}

I created a function to apply the same logic in whatever group I want to analyse.

Note: Another relevant edit was that the selection criteria changed as I missed some packages in some analysis and included other that shouldn’t be. Now we are ready to apply to those that got the first version of the package on CRAN:

first_submissions <- accepted_subm |> 
  group_by(package) |> 
  # Keep submission that where eventually accepted
  filter(length(acceptance_n != 0L) > 1L && any(acceptance_n[acceptance_n != 0L] == 1)) |> 
  # Keep submissions until the first acceptance but not after
  filter(cumsum(acceptance_n) <= 1L & seq_len(n()) <= which(acceptance_n == 1L)) |> 
  ungroup()
ffs <- first_submissions |>   
  count_submissions() |> 
  count(total_attempts, sort = TRUE,  name = "packages") |> 
  mutate(percentage = lp(packages/sum(packages, na.rm = TRUE)))
ffs

total_attempts	packages	percentage
0	3390	65.0%
1	1141	21.9%
2	425	8.2%
3	138	2.6%
4	72	1.4%
5	23	0.4%
6	12	0.2%
7	4	0.1%
8	3	0.1%
9	2	0.0%
12	1	0.0%
16	1	0.0%

This means that close to 35.0% first time submissions are rejected. Including those that are not yet (never?) included on CRAN (~1000).

This points out a problem:

the developers need to resubmit their packages and fix it more.
the reviewers need to spend more time (approximately 50% of submissions are at one point or another handled by a human).

After this exercise we might wonder whether this is just for new packages?
If we look up those submissions that are not the first version of a package, we find the following:

submissions_with_accepted <- accepted_subm |> 
  # Filter those that were included on CRAN (not all submission rejected)
  filter(any(acceptance_n >= 1), .by = package) |> 
  mutate(date.mix = pmin(date.cran, date.subm, na.rm = TRUE)) |> 
  group_by(package) |> 
  arrange(date.mix) |> 
  filter(
    # Those that start by 0 but next acceptance is 1 or higher
     cumsum(acceptance_n) >= 1L | 
       min(acceptance_n[acceptance_n != 0L], na.rm = TRUE) >= 2) |> 
  ungroup() 
fs_exp <- count_submissions(submissions_with_accepted)
fs_exp |> 
  count(more_failed = total_accepted > total_attempts, 
            sort = TRUE, name = "packages") |> 
  mutate(percentage = lp(packages/sum(packages)))

more_failed	packages	percentage
TRUE	15337	96.2%
FALSE	600	3.8%

Still the majority of packages have more attempts than versions released in the period analysed. Failing the checks on CRAN is normal, but how many more attempts are to CRAN?

library("ggrepel")
ggplot(fs_exp) +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  geom_count(aes(total_accepted, total_attempts)) +
  geom_label_repel(aes(total_accepted, total_attempts, label = package), data = . %>% filter(total_attempts >= 10)) +
  labs(x = "CRAN versions", y = "Rejected submissions",  size = "Packages",
       title = "Packages after the first version", subtitle = "for the period analyzed") +
  scale_size(trans = "log10") +
  theme_minimal() +
  theme(legend.position = c(0.8, 0.7), legend.background = element_rect())

We can see that there are packages with more than 30 versions on CRAN in these 3 years which never had a rejected submission. Congratulations!!

Others have a high number of submissions rejected, and very few versions:

fs_exp |> 
  count(total_attempts > total_accepted, name = "packages") |> 
  mutate(percentage = lp(packages/sum(packages)))

total_attempts > total_accepted	packages	percentage
FALSE	15792	99.1%
TRUE	145	0.9%

Close to 1% require more than double submissions per version.

Last we can see the overall experience for developers:

fs <- count_submissions(accepted_subm)

ggplot(fs) +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  geom_count(aes(total_accepted, total_attempts)) +
  geom_label_repel(aes(total_accepted, total_attempts, label = package), 
                   data = . %>% filter(total_attempts >= 12)) +
  labs(x = "CRAN versions", y = "Rejected submissions",  size = "Packages",
       title = "All packages submissions", subtitle = "for the period analyzed ~174 weeks") +
  theme_minimal() +
  scale_size(trans = "log10") +
  theme(legend.position = c(0.8, 0.7), legend.background = element_rect())

It doesn’t change much between the experienced. Note that this only add the packages that were not approved ever and the submissions done to be first accepted. So the changes should only be observable on the bottom left corner of the plot.

Overall, 14.5% of the attempts end up being rejected.

Main take away

Submitting to CRAN is not easy on the first try, and it usually requires 2 submissions for each accepted version.
While Writing R extensions document is clear, it might be too extensive for many cases.
The CRAN policy is short but might not be clear enough for new maintainers.
A document in the middle might be R packages but it is still extensive and focused on only a small opionated set of packages.
A CRAN Task View or some training might be a good solution to reduce the overall problem.
For those maintainers struggling, maybe clearer technical or editorial decisions might be a good solution.

In addition, it seems that packages having more problems with the submissions are not new: experienced maintainers have troubles getting their package accepted when submitting them.
This might hint to troubles replicating the CRAN checks or environments or the scale of the checks (dependency checks).
Maybe focusing on helping those packages’ maintainer might provide a good way to help CRAN maintainers reduce their load.

I also want to comment that this analysis could be improved if we knew, whether the rejection was automatic or manual.
This would allow to see the burden on CRAN volunteers and perhaps define better the problem and propose better solutions.
It could be attempted by looking the last folder of a package in the submission process, but it would still not be clear what the most frequent problem is.

Bonus

From all the new packages more than half are already archived (with either newer versions or totally):

accepted_subm |> 
  filter(acceptance_n == 1L) |> 
  count(status = type, name = "packages") |> 
  mutate(percentage = lp(packages/sum(packages)))

status	packages	percentage
archived	4763	65.4%
available	2517	34.6%

Of them:

fully_archived <- accepted_subm |>
  filter(acceptance_n != 0L) |> 
  filter(any(acceptance_n == 1L), .by = package) |> 
  summarize(archived = all(type == "archived"), .by = package) |> 
  count(archived, name = "packages") |> 
  mutate(percentage = lp(packages/sum(packages)))
fully_archived

archived	packages	percentage
FALSE	6783	93.2%
TRUE	497	6.8%

Only 6.8% of packages were fully archived at the end of this period 2020-09-12, 2024-01-20.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.1 (2023-06-16)
##  os       Ubuntu 22.04.3 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Madrid
##  date     2024-01-20
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.18    2023-06-19 [1] CRAN (R 4.3.1)
##  bookdown      0.37    2023-12-01 [1] CRAN (R 4.3.1)
##  bslib         0.6.1   2023-11-28 [1] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.1)
##  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.1)
##  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.1)
##  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
##  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.1)
##  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.2)
##  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.1)
##  farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.1)
##  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
##  generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.1)
##  ggplot2     * 3.4.4   2023-10-12 [1] CRAN (R 4.3.1)
##  ggrepel     * 0.9.5   2024-01-10 [1] CRAN (R 4.3.1)
##  glue          1.7.0   2024-01-09 [1] CRAN (R 4.3.1)
##  gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.1)
##  highr         0.10    2022-12-22 [1] CRAN (R 4.3.1)
##  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.3.1)
##  jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
##  knitr       * 1.45    2023-10-30 [1] CRAN (R 4.3.2)
##  labeling      0.4.3   2023-08-29 [1] CRAN (R 4.3.2)
##  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.2)
##  lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
##  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.1)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.1)
##  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.1)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.1)
##  purrr         1.0.2   2023-08-10 [1] CRAN (R 4.3.1)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
##  Rcpp          1.0.12  2024-01-09 [1] CRAN (R 4.3.1)
##  rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.1)
##  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
##  sass          0.4.8   2023-12-06 [1] CRAN (R 4.3.1)
##  scales        1.3.0   2023-11-28 [1] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
##  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.3.1)
##  tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.1)
##  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.1)
##  timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.1)
##  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.2)
##  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
##  withr         2.5.2   2023-10-30 [1] CRAN (R 4.3.2)
##  xfun          0.41    2023-11-01 [1] CRAN (R 4.3.2)
##  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.1)
## 
##  [1] /home/lluis/bin/R/4.3.1
##  [2] /opt/R/4.3.1/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Packaging R: repositories

Sat, 09 Dec 2023 00:00:00 +0000

In this post I want to collect some thoughts about R repositories. In R we have multiple repositories that store packages for users. In this post I want to write about the purpose, functionality, benefits and drawbacks of R repositories and how packages are managed. The goal is to summarize what I’ve learnt these last years about them. I’ll also collect some information about them from various sources to make it easier for myself to find it later on.

I am writing this because I am worried about the future of CRAN and R. Due to multiple circumstances, the current position is not sustainable as is. I hope that this post, will help me to understand the past, present and create some concrete steps to do.

History

I was not there, but the first repository started around April 1997. This repository is CRAN: the Comprehensive R Archive Network. The first mention I found is already about changes in it, but it was not until the end of the month when it was announced.

CRAN was created by a few volunteers, some of which are still mainting it 25 years later. The current team is listed on their website. From the beginning it was “a collection of sites which carry identical material, consisting of the R&R R distribution(s), the contributed extensions, documentation for R, and binaries.”

Omegahat was another repository created shortly after CRAN:

The Omega project began in July, 1998, with discussions among designers responsible for three current statistical languages (S, R, and Lisp-Stat), with the idea of working together on new directions with special emphasis on web-based software, Java, the Java virtual machine, and distributed computing.

Many developers of Omegahat were in the R Core or CRAN team. It was available as a repository from the R source code but was removed definitely in version R 4.1, in 2021¹.

Bioconductor, was the next major repository that appeared. It was funded by Robert Gentleman and others in 2004 (it started the mailing list). A paper describing it appeared in late 2004:

an initiative for the collaborative creation of extensible software for computational biology and bioinformatics.

Through its history repositories have evolved with R and R with them. For example: R was released twice a year at the beginning, and Bioconductor did too. But when R moved to be released once per year (in 2013 with version 3.0) Bioconductor kept using two releases a year. This introduced some problems when installing packages from Bioconductor, when a single R release can be compatible with two Bioconductor releases².

In other cases, checks have evolved. For instance Solaris was used to test packages in CRAN until 2021, if I recall correctly, because it allowed to test in a proprietary C or C++ compiler. This lead to discover bugs but also to more distress in R-package developers which had difficulties checking their packages in that environment.

Other checks evolve with R, becoming more strict with time: In the early versions of R the use of NAMESPACE was not regulated. But since R version 2.15, 2012 it was compulsory even for data-only packages³. This was synchronized with repositories checks.

Last, some goals/desires of CRAN are not fulfilled (or where abandoned). For example, from the start CRAN aimed to have packages authenticated (see the bottom of the announcement). This might be due to lack of time, resources or that the plans are in progress but require (volunteer) time.

With time, different repositories arose:

MRAN, which was available since September 17th, 2014 to July 1st, 2022.
The Rstudio Public Package Manager later renamed Posit Public Package Manager has binaries for several OS since 2019.
There is the R4pi repository with binaries for Raspberry Pi.
I remember a proteomics repository available.
rOpenSci started its own repository which later evolved into the r-universe. The r-universe currently can provide binaries of packages that are hosted in a git repository.

Literature

The role and prominence of the repositories has lead to many articles being written about it. I wanted to link and collect some of them for easier retrieval.

I was wondering how CRAN is described by the volunteers that built it. From the announcing email:

CRAN is a collection of sites which carry identical material, consisting of the R&R R distribution(s), the contributed extensions, documentation for R, and binaries.

From the website (at 2023/12/09):

CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.

Initially there was R NEWS, with an article dedicated to CRAN and one to Omegahat too. These articles usually describe new package additions but sometimes they also provide information about changes:

CRAN-2001-1: It list new packages, CRAN-2001-2, CRAN-2001-3, Omegahat-2001-3
CRAN-2002-1, CRAN-2002-2, CRAN-2002-3
CRAN-2003-1, CRAN-2003-2, CRAN-2003-3
CRAN-2004-1, CRAN-2004-2
CRAN-2005-1, CRAN-2005-2

Since 2006 there is also an article about Bioconductor.
CRAN-2006-2, Bioc-2006-2
CRAN-2007-1, CRAN-2007-2, Bioc-2007-2, CRAN-2007-3
CRAN-2008-1, Bioc-2008-1 CRAN-2008-2, Bioc-2008-2

Later it became the R Journal:

CRAN-2009-1, CRAN-2009-2
CRAN-2010-1, CRAN-2010-2
CRAN-2011-1, CRAN and Bioconductor 2011-2. In the bioconductor section it mentions conference, and important directions for the Bioconductor core.
CRAN-2012-1, CRAN and Bioconductor 2012-2: Mentions biocLite() to install packages.
CRAN-2013-1 Bioc-2013-1: mentions better integration of parallel evaluation.
CRAN-2013-2, Bioc-2013-2: Mentions again AnnotationHub
CRAN-2014-1, Bioc-2014-1: Mentions the git-svn bridge to synchronize git and svn repository.
CRAN-2014-2, Bioc-2014-2: Bioconductor 3.0 release, besides some packages Amazon Machine Image are offered as well as docker images. Packages are required to pass BiocCheck, checks in a different package specific for Bioconductor.
CRAN-2015-1, Bioc-2015-1: Same mentions as the previous and encouragement to guidelines an package submission.
CRAN-2015-2, Bioc-2015-2
CRAN-2016-1: on this article there is a plot of the number of CRAN packages and time, and doesn’t list all packages listed. It explicitly mentions that the CRAN team asked for help processing package submissions and some people stepped up. Bioc-2016-1

CRAN-2016-2, Bioc-2016-2
CRAN-2017-1: mentions changes in CRAN checks, adding new memory access and static code analysis checks. It mentions that the submission has moved to a more automated one. It also mentions changes in the CRAN Repository Policy. Bioc-2017-1
CRAN-2018-1: checks in alternative BLAS/LAPACK implementations, the submission pipeline is defined. First time the amount of action taken by CRAN reviewers is listed in two categories automatic and manual. Changes in repository policy are listed. Changes in location of package repository archive , Bioc-2018-1
CRAN-2018-2: Changes in policy; packages should not give a check warning nor error. Bioc-2018-2: Moved to BiocManager to install packages.
CRAN-2019-1: More mentions to CRAN mirror security.

CRAN-2019-2: Updates in checklist for CRAN submissions, Bioc-2019-2
CRAN-2020-1: Many changes in CRAN policies. CRAN-2020-2: Many changes to CRAN policies. Bioc-2020-2: Announces the Technical and Community advisory boards (as well as the project-wide Code of Conduct).
CRAN-2021-1, Bioc-2021-1: Mentions conferences that will be virtual.
CRAN-2021-2: Shows an incomig path [See this friendly viewer, Bioc-2021-2: Mentions AnVIL and two online workshops to develop workflows.
CRAN-2022-1: List a change in CRAN policy and the CRAN Task View Initiative.
CRAN-2022-2: List some more repository policies. Bioc-2022-2: Lists infrastructure updates (and its funding), changes in the core team and new initiatives.
CRAN-2022-3, Bioc-2022-3
CRAN-2022-4, Bioc-2022-4: default branch renaming, partnership with Outreachy and blog are featured. Several working groups provide updates
CRAN-2023-1

In addition, several articles and blog posts have appeared. From those I found it is worth mentioning the following:

And my own posts:

Reasons CRAN packages are archived
CRAN files part 1
CRAN files part 2
CRAN maintained packages
CRAN review (and the talk at useRs 2021)
Bioconductor review
rOpenSci reviews
Bioconductor reviews
The article “Aspects of the Social Organization and Trajectory of the R Project”, from the R Journal 2009, also has a section about CRAN, noting that it “is challenged by its own success”.

Characteristics

The predominance of CRAN and its role as primary and default R repository has lead to some special treatment of the repository.

CRAN checks are in the R source code itself. While other repositories have their own checks in different tools. In addition, the CRAN environmental variables used are documented in the R-internals (they are more or less accessible in the svn repository too).

Others who know more have stated the benefits of CRAN too: This text is copied from Henrik Bengstsson in Bioconductor Slack:

FOREVER ARCHIVE:

The first one is that it publishes packages and versions of them until the end of time. When a package has been published on CRAN, it takes a lot for it to be removed from there. I don’t know if it ever happened, but I can imagine a package can be fully removed if it was illegally published in the first place (e.g. copyright, illegal content, ...) or malicious.

INSTALLATION SERVICE:

Then CRAN also provides a R package repository service for installing packages on CRAN using built-in R functions. The set of packages in the package repo is a subset of all packages on CRAN. The CRAN package repo makes a promise that all packages listed in PACKAGES can be installed. If they cannot make that promise, they’ll archive the package (=remove it from PACKAGES). I should also say, install.packages(url) can be used to install from the set of packages that are archived. Technically, old package versions are always archived.

CHECK SERVICE:

The content of the R package repository is guided by the CRAN package checks that run on R-oldrel, R-release, and R-devel across multiple platforms. The minimal requirement is that no package should remain in the package repository if the checks detects ERRORs (and those errors are not due to recently introduced bugs in R-devel). WARNINGs can also cause a package to be archived, but that process often takes longer. AFAIK, NOTEs are not a cause for a package being archived (but I could be wrong). The CRAN incoming checks, which you have to pass when you submit a new package, or an updated version, will make sure that the published package pass with all OKs. (It’s possible to argue for NOTEs being false positives, or for them not to be fixed, but that requires a manual approval by the CRAN Team).

I think there are many more resources discussing R repositories. If you know more I’ll be happy to update this post.

I think before I drag too much on the steps I’ll post this and collect some more articles I might have missed.

Last, Uwe Liegges presented about CRAN in useR!2017, thanks Tim Taylor for sharing it. In this video there is an explanation of why the solaris OS was used.

It has come to my attention that there is an article, by G. Brooke Anderson and Dirk Eddelbuette, about the R package repositories structure (among other things): Hosting Data Packages via drat: A Case Study with Hurricane Exposure Data

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.1 (2023-06-16)
##  os       Ubuntu 22.04.3 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Madrid
##  date     2024-01-15
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.18    2023-06-19 [1] CRAN (R 4.3.1)
##  bookdown      0.37    2023-12-01 [1] CRAN (R 4.3.1)
##  bslib         0.6.1   2023-11-28 [1] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.1)
##  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.1)
##  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
##  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.2)
##  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
##  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.3.1)
##  jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
##  knitr         1.45    2023-10-30 [1] CRAN (R 4.3.2)
##  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.2)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
##  rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.1)
##  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
##  sass          0.4.8   2023-12-06 [1] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
##  xfun          0.41    2023-11-01 [1] CRAN (R 4.3.2)
##  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.1)
## 
##  [1] /home/lluis/bin/R/4.3.1
##  [2] /opt/R/4.3.1/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

In version 3.1.2 Omegahat didn’t provide Windows binaries and in 4.1 from the default repositories (See 4.1 in NEWS(.4)).↩︎
This lead to the need of having a special function to install packages from Bioconductor. Initially a function biocLite and later with the BiocManager package.↩︎
NEWS in 2.15 section ↩︎

BaseSet 0.9.0

Wed, 23 Aug 2023 00:00:00 +0000

I’m excited to provide a new release of BaseSet, the package implementing a a class and methods to work with (fuzzy) sets.

This release was focused on making it easier to work with it.

From the beginning it was engineered towards the tidyverse and this time I focused on general R methods like $, [, c:

New methods

First we can create a TidySet or TS for short:

library("BaseSet", warn.conflicts = FALSE)
packageVersion("BaseSet")
## [1] '0.9.0'
l <- list(A = "1",
     B = c("1", "2"),
     C = c("2", "3", "4"),
     D = c("1", "2", "3", "4")
)
TS <- tidySet(l)

Up till now there was no compatibility with the base R methods but there was with the tidyverse.

TSa <- TS[["A"]]
TSb <- TS[["B"]]

Maybe this doesn’t look much but previously it wasn’t possible to subset the class. Initially I thought that working with a single class per session would be enough. Later I realized that maybe people would have good reasons to split or combine multiple objects:

TSab <- c(TSa, TSb)
TSab
##   elements sets fuzzy
## 1        1    A     1
## 2        1    B     1
## 3        2    B     1

Note that subsetting by sets does not produce the same object as elements are kept:

dim(TSab)
##  Elements Relations      Sets 
##         2         3         2
dim(TS[1:2, "sets"])
##  Elements Relations      Sets 
##         4         3         2

You’ll need to drop the elements:

dim(droplevels(TS[1:2, "sets"]))
##  Elements Relations      Sets 
##         2         3         2

We can include more information like this:

TSab[1:2, "relations", "type"] <- c("new", "addition")
TSab[1:2, "sets", "origin"] <- c("fake", "real")
TSab
##   elements sets fuzzy     type origin
## 1        1    A     1      new   fake
## 2        1    B     1 addition   real
## 3        2    B     1     <NA>   real

With this release is easier to access the columns of the TidySet:

TSab$type
## [1] "new"      "addition" NA
TSab$origin
## [1] "fake" "real"
TS$sets
##  [1] "A" "B" "B" "C" "C" "C" "D" "D" "D" "D"

If you pay attention you’ll realize that it will look at the minimum information required. But if the column is present in the relations and elements or sets slots it will pick the first.

You can use:

TS[, "sets", "new"] <- "a"
TS[, "sets", "new"]

I recommend reading carefully the help page of ?`extract-TidySet` and make some tests based on the examples. I might have created some bugs or friction points with the extraction operations, let me know and I’ll address them (That’s the reason why I kept it below a 1.0 release).

More usable

Another usability addition to the class is the possibility to autocomplete.

Now if you tab TS$ty and press TAB it should complete to TS$type because there is a column called type. This will make it easier to use the $.

With this release, we can now check the number of sets and the number of relations each set has:

length(TS)
## [1] 4
lengths(TS)
## A B C D 
## 1 2 3 4

New function

The new function union_closed checks if the combinations of sets produce already existing sets.

union_closed(TS, sets = c("A", "B", "C"))
## [1] FALSE
union_closed(TS)
## [1] TRUE

Next steps

I hope this makes it even easier to work with the class. Combine different objects, and manipulate it more intuitively.

While creating this document I realized it has some friction points.
In next release it will be possible to:

Subset the object by element or set name, if only querying elements and sets slots. For example TS[c("3", "4"), "elements", "NEWS"] <- TRUE
Use names and dimnames to discover which data is in the object.
Some bug fixes about these methods.

Enjoy!

I would also apreciate to hear some feedback about how you are using the package. It will help me to direct the development/maintenance of the package wherever it is more useful.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.1 (2023-06-16)
##  os       Ubuntu 22.04.3 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Madrid
##  date     2023-12-18
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  BaseSet     * 0.9.0   2023-08-23 [1] local
##  blogdown      1.18    2023-06-19 [1] CRAN (R 4.3.1)
##  bookdown      0.37    2023-12-01 [1] CRAN (R 4.3.1)
##  bslib         0.6.1   2023-11-28 [1] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.1)
##  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.1)
##  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
##  dplyr         1.1.4   2023-11-17 [1] CRAN (R 4.3.1)
##  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.2)
##  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.1)
##  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
##  generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.1)
##  glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.1)
##  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.3.1)
##  jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
##  knitr         1.45    2023-10-30 [1] CRAN (R 4.3.2)
##  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.2)
##  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.1)
##  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.1)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.1)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
##  rlang         1.1.2   2023-11-04 [1] CRAN (R 4.3.1)
##  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
##  sass          0.4.8   2023-12-06 [1] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
##  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.3.1)
##  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.1)
##  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.2)
##  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
##  xfun          0.41    2023-11-01 [1] CRAN (R 4.3.2)
##  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.1)
## 
##  [1] /home/lluis/bin/R/4.3.1
##  [2] /opt/R/4.3.1/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

CRAN maintained packages

Wed, 03 May 2023 00:00:00 +0000

The role of package managers in software is paramount for developers. In R the CRAN team provides a platform to tests and host packages. This means ensuring that R dependencies are up to date and software required by some packages are also available in CRAN.

This helps testing ~20000 packages frequently (daily for most packages) in several architectures and R versions. In addition, they test updates for compatibility with the dependencies and test and review new packages.

Most of the work with packages is automated but often requires human intervention (50% of the submisions). Another consuming activity is keeping up packages abandoned by their original maintainers.

While newer packages are archived from CRAN often, some old packages were adopted by CRAN. The CRAN team is looking for help maintining those.

In this post I’ll explore the packages maintained by CRAN.

CRAN in packages

packages_db <- as.data.frame(tools::CRAN_package_db())
cran_author <- grep("CRAN Team", x = packages_db$Author, ignore.case = TRUE)
cran_authorsR <- grep("CRAN Team", x = packages_db$`Authors@R`, ignore.case = TRUE)
CRAN_TEAM_mentioned <- union(cran_author, cran_authorsR)
unique(packages_db$Package[CRAN_TEAM_mentioned])
## [1] "fBasics"   "fMultivar" "geiger"    "plotrix"   "RCurl"     "RJSONIO"  
## [7] "udunits2"  "XML"

In some of these package the CRAN team appears as contributors because they provided help/code to fix bugs:

In others they are the maintainers:

From these three packages RJSONIO is the newest (first release in 2010) and requires less updates (lately 1 or 2 a year). However, in 2022 RCurl and XML required 4 and 5 updates respectively. I will focus on these packages as these are the ones they are looking for new maintainers.

RCurl and XML

Circular dependency

Both XML and RCurl depend on each other.

We can see that the packages are direct dependencies of one of their direct dependencies! How can be that? If we go the the RCurl website we see in “Suggests: XML”, and in the XML website the RCurl is there too. This circular dependency is allowed because they have each other in Suggests.

A first step to reduce any possible problem would be to separate them. This would make it easier understanding which package is worth prioritizing and possible missteps will have less impact.

If we look at XML source code for RCurl we find some code in inst/ folder. If these two cases were removed the package could remove its dependency to RCurl.

Similarly, if we look at RCurl source code for XML we find some code in inst/ folder and in some examples. If these three cases were removed the package could remove its dependency to XML.

RCurl has been more stable than XML, which have seen new functions added and one removed since CRAN is maintaining it.

Relevant data

We will look at 4 sets of data for each pacakge: dependencies, releases, maintainers and downloads.

Dependencies

Both packages have some system dependencies which might make the maintenance harder. In addition they have a large number of dependencies. We can gather the dependencies in CRAN and Bioconductor software packages:

library("tools")
# Look up only software dependencies in Bioconductor
options(repos = BiocManager::repositories()[c("BioCsoft", "CRAN")])
ap <- available.packages()
all_deps <- package_dependencies(c("RCurl", "XML"), 
                                 reverse = TRUE, db = ap, which = "all")
all_unique_deps <- unique(unlist(all_deps, FALSE, FALSE))
first_deps <- package_dependencies(all_unique_deps, db = ap, which = "all")
first_deps_strong <- package_dependencies(all_unique_deps, db = ap, which = "strong")
strong <- sapply(first_deps_strong, function(x){any(c("XML", "RCurl") %in% x)})
deps_strong <- package_dependencies(all_unique_deps, recursive = TRUE, 
                                 db = ap, which = "strong")
first_rdeps <- package_dependencies(all_unique_deps, 
                                   reverse = TRUE, db = ap, which = "all")
deps_all <- package_dependencies(all_unique_deps, recursive = TRUE, 
                                 db = ap, which = "all")

They have 495 direct dependencies (and 8 more in annotation packages in Bioconductor: recount3, ENCODExplorerData, UCSCRepeatMasker, gDNAinRNAseqData, qdap, qdapTools, metaboliteIDmapping and curatedBreastData).

These two packages with their dependencies are used one way or another by around 20000 packages (about 90% of CRAN and Bioconductor)! If these packages fail the impact on the community will be huge.

To reduce the impact of the dependencies we should look up the direct dependencies. But we also looked at the reverse dependencies to asses the impact of the package in the other packages.

Know which are these, and who maintain them will help decide what is the best course of action.

Releases

A first approach is looking into the number of releases and dates to asses if the package has an active maintainer or not:

archive <- tools:::CRAN_archive_db()[all_unique_deps]
packages <- tools::CRAN_package_db()
library("dplyr")
library("BiocPkgTools")
fr <- vapply(archive, function(x) {
  if (is.null(x)) {
    return(NA)
  }
  as.Date(x$mtime[1])
}, FUN.VALUE = Sys.Date())
fr_bioc <- biocDownloadStats("software") |> 
  filter(Package %in% all_unique_deps) |> 
  firstInBioc() |> 
  pull(Date, name = Package)
first_release <- c(as.Date(fr[!is.na(fr)]), as.Date(fr_bioc))[all_unique_deps]
last_update <- packages$Published[match(all_unique_deps, packages$Package)]
releases <- vapply(archive, NROW, numeric(1L)) + 1

We only have information about CRAN packages:
Bioconductor has two releases every year, and while the maintainers can release patched versions of packages between them that information is not stored (or easily retrieved, they are still available in the git server).

Even if Bioconductor maintainers didn’t modify the package the version number increases with each release. But the version update in the git doesn’t propagate to users automatically unless their checks pass. For all these reasons it doesn’t make sense to count releases of packages in Bioconductor.

Maintainers

Now that we know which packages are more active, we can look up for the people behind it. This way we can prioritize working with maintainers that are known to be active¹.

maintainers <- packages_db$Maintainer[match(all_unique_deps, packages_db$Package)]
maintainers <- trimws(gsub("<.+>", "", maintainers))

Once again, the Bioconductor repository doesn’t provide a file to gather this kind of data.

Downloads

Another variable we can use are the downloads from users of said packages. Probably, packages more downloaded are used more and a breaking change on them will have impact on more people than other packages.

library("cranlogs")
acd <- cran_downloads(intersect(all_unique_deps, packages_db$Package), 
                          when = "last-month")
cran_pkg <- summarise(acd, downloads = sum(count), .by = package)
loc <- Sys.setlocale(locale = "C")
bioc_d <- vapply(setdiff(all_unique_deps, packages_db$Package), function(x){
  pkg <- pkgDownloadStats(x)
  tail(pkg$Nb_of_downloads, 1)
  }, numeric(1L))
bioc_pkg <- data.frame(package = names(bioc_d), downloads = bioc_d)
downloads <- rbind(bioc_pkg, cran_pkg)
rownames(downloads) <- downloads$package
dwn <- downloads[all_unique_deps, ]

The logs are provided by the global mirror of CRAN (sponsored by Rstudio).
The Bioconductor infrastructure which provides total number of downloads and number of downloads from distinct IPs ².

Analysis

We collected the data that might be relevant. Now, we can start looking all the data gathered:

repo <- vector("character", length(all_unique_deps))
ap_deps <- ap[all_unique_deps, ]
repo[startsWith(ap_deps[, "Repository"], "https://bioc")] <- "Bioconductor"
repo[!startsWith(ap_deps[, "Repository"], "https://bioc")] <- "CRAN"
deps <- data.frame(package = all_unique_deps,
                   direct_dep_XML = all_unique_deps %in% all_deps$XML,
                   direct_dep_RCurl = all_unique_deps %in% all_deps$RCurl,
                   first_deps_n = lengths(first_deps),
                   deps_all_n = lengths(deps_all),
                   first_rdeps_n = lengths(first_rdeps),
                   first_deps_strong_n = lengths(first_deps_strong), 
                   deps_strong_n = lengths(deps_strong),
                   direct_strong = strong, 
                   releases = releases,
                   strong = strong, 
                   first_release = first_release,
                   last_release = last_update,
                   maintainer = maintainers,
                   downloads = dwn$downloads,
                   repository = repo) |> 
  mutate(type = case_when(direct_dep_XML & direct_dep_RCurl ~ "both",
                          direct_dep_XML ~ "XML",
                          direct_dep_RCurl ~ "RCurl"))
rownames(deps) <- NULL
head(deps)

package	direct_dep_XML	direct_dep_RCurl	first_deps_n	deps_all_n	first_rdeps_n	first_deps_strong_n	deps_strong_n	direct_strong	releases	strong	first_release	last_release	maintainer	downloads	repository	type
AnnotationForge	TRUE	TRUE	26	2456	5	10	47	TRUE	1	TRUE	2012-02-01	NA	NA	8113	Bioconductor	both
AnnotationHubData	TRUE	TRUE	33	2456	4	26	136	TRUE	1	TRUE	2015-02-01	NA	NA	6619	Bioconductor	both
autonomics	FALSE	TRUE	61	2499	0	34	104	FALSE	1	FALSE	2021-02-01	NA	NA	91	Bioconductor	RCurl
BaseSpaceR	FALSE	TRUE	6	2456	0	3	4	TRUE	1	TRUE	2013-02-01	NA	NA	218	Bioconductor	RCurl
BayesSpace	FALSE	TRUE	34	2459	0	24	161	TRUE	1	TRUE	2020-02-01	NA	NA	221	Bioconductor	RCurl
BgeeDB	FALSE	TRUE	19	2457	2	14	71	TRUE	1	TRUE	2016-02-01	NA	NA	238	Bioconductor	RCurl

I added some numbers and logical values that might help exploring this data.

We will look at the packages dependencies between RCurl and XML, some plots to have a quick view

Distribution dependencies

Let’s see how many packages depend in each of them:

deps |> 
  summarise(Packages = n(), deps = sum(first_deps_n),
            q25 = quantile(deps_all_n, probs = 0.25),
            mean_all = mean(deps_all_n),
            q75 = quantile(deps_all_n, probs = 0.75),
            .by = c(direct_dep_XML, direct_dep_RCurl)) |> 
  arrange(-Packages)

direct_dep_XML	direct_dep_RCurl	Packages	deps	q25	mean_all	q75
TRUE	FALSE	235	3584	2456	2365.596	2458.5
FALSE	TRUE	193	3187	2456	2320.855	2460.0
TRUE	TRUE	67	1216	2456	2423.119	2457.5

There are ~40 more packages depending on XML than to RCurl and just 67 to both of them.

Overview

We can plot some variables to get a quick overview of the packages:

library("ggplot2")
library("ggrepel")
deps_wo <- filter(deps, !package %in% c("XML", "RCurl"))
deps_wo |> 
  ggplot() +
  geom_point(aes(first_deps_n, downloads, shape = type)) +
  geom_text_repel(aes(first_deps_n, downloads, label = package),
                  data = filter(deps_wo, first_deps_n > 40 | downloads > 10^5)) +
  theme_minimal() +
  scale_y_log10(labels = scales::label_log()) +
  labs(title = "Packages and downloads", 
       x = "Direct dependencies", y = "Downloads", size = "Packages")
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Figure 1: Direct dependencies vs downloads. Many pakcages have up to 50 packages and most have below 1000 downloads in a month.

There is an outlier on 1, the mlr package has more than 10k downloads and close to 120 direct dependencies, but down to less than 15 strong dependencies !

deps_wo |> 
  ggplot() +
  geom_point(aes(first_deps_n, first_rdeps_n, shape = type)) +
  geom_text_repel(aes(first_deps_n, first_rdeps_n, label = package),
                  data = filter(deps_wo, first_deps_n > 60 | first_rdeps_n > 50)) +
  theme_minimal() +
  scale_y_log10(labels = scales::label_log()) +
  labs(title = "Few dependencies but lots of dependents",
    x = "Direct dependencies", y = "Depend on them", size = "Packages")
## Warning: Transformation introduced infinite values in continuous y-axis
## Transformation introduced infinite values in continuous y-axis

Figure 2: Dependencies vs packages that depend on them.

In general though, the packages that have more dependencies have less direct dependencies.

library("ggplot2")
library("ggrepel")
deps_wo <- filter(deps, !package %in% c("XML", "RCurl"))
deps_wo |> 
  ggplot() +
  geom_vline(xintercept = 20, linetype = 2) +
  geom_point(aes(first_deps_strong_n, downloads, shape = repository)) +
  geom_text_repel(aes(first_deps_strong_n, downloads, label = package),
                  data = filter(deps_wo, first_deps_strong_n > 20 | downloads > 10^5)) +
  theme_minimal() +
  scale_y_log10(labels = scales::label_log()) +
  labs(title = "Packages and downloads", 
       x = "Direct strong dependencies", y = "Downloads", shape = "Repository")
## Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Figure 3: Direct strong dependencies vs downloads. Many pakcages have more than 20 direct imports.

One observable effect is that many packages do not comply with current CRAN rules of having 20 strong dependencies (as described in R-internals). This suggests that these CRAN packages are old or that this limit is not checked in packages updates.

data_maintainers <- deps_wo |> 
  filter(!is.na(maintainer)) |> 
  summarize(n = n(), downloads = sum(downloads), .by = maintainer)
data_maintainers |> 
  ggplot() +
  geom_point(aes(n, downloads)) +
  geom_text_repel(aes(n, downloads, label = maintainer),
                  data = filter(data_maintainers, n > 2 | downloads > 10^4)) +
  scale_y_log10(labels = scales::label_log()) +
  scale_x_continuous(breaks = 1:10, minor_breaks = NULL) +
  theme_minimal() +
  labs(title = "CRAN maintainers that depend on XML and RCurl",
       x = "Packages", y = "Downloads")
## Warning: ggrepel: 15 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Figure 4: Looking at maintainers and the number of downloads they have.

Most maintainer have few packages, some highly used packages but some have many packages relatively highly used.

Finding important packages

We can use a PCA to find which packages are more important.

cols_pca <-  c(4:7, 15)
pca_all <- prcomp(deps_wo[, cols_pca], scale. = TRUE, center = TRUE)
summary(pca_all)
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.386 1.2478 0.9458 0.65380 0.44846
## Proportion of Variance 0.384 0.3114 0.1789 0.08549 0.04022
## Cumulative Proportion  0.384 0.6954 0.8743 0.95978 1.00000
pca_data <- cbind(pca_all$x, deps_wo)
ggplot(pca_data) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  geom_point(aes(PC1, PC2, col = repository, shape = repository)) +
  geom_text_repel(aes(PC1, PC2, label = package), 
                  data = filter(pca_data, abs(PC1) > 2 | abs(PC2) > 2)) +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  labs(title = "PCA of the numeric variables")
## Warning: ggrepel: 58 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Figure 5: PCA of all packages.

We can see in the first PCA some packages that have many downloads and/or depend on many packages. The second one are packages with many dependencies, as explained by rotation:

pca_all$rotation[, 1:2]

	PC1	PC2
first_deps_n	-0.6521642	-0.1528947
deps_all_n	-0.3304698	-0.0549046
first_rdeps_n	0.1235972	-0.6948659
first_deps_strong_n	-0.6606765	-0.0750116
downloads	0.1170554	-0.6965223

But more important is that are packages that are named in 5, there is the RUnit package, markdown and rgeos that have high number of downloads and many package depend on them one way or another.

However we can focus on packages that without RCurl or XML wouldn’t work:

pca_strong <- prcomp(deps_wo[deps_wo$strong, cols_pca], 
                     scale. = TRUE, center = TRUE)
summary(pca_strong)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.4198 1.3005 0.9373 0.49421 0.41258
## Proportion of Variance 0.4032 0.3382 0.1757 0.04885 0.03404
## Cumulative Proportion  0.4032 0.7414 0.9171 0.96596 1.00000
pca_data_strong <- cbind(pca_strong$x, deps_wo[deps_wo$strong, ])
ggplot(pca_data_strong) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  geom_point(aes(PC1, PC2, col = repository, shape = repository)) +
    geom_text_repel(aes(PC1, PC2, label = package), 
                  data = filter(pca_data_strong, abs(PC1) > 2 | abs(PC2) > 2)) +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  labs(title = "Important packages depending on XML and RCurl", 
       subtitle = "PCA of numeric variables of strong dependencies",
       col = "Repository", shape = "Repository")
## Warning: ggrepel: 42 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Figure 6: PCA of packages with strong dependency to XML or RCurl.

The main packages that depend on XML and RCurl are from Biocondcutor, followed by mlr and rlist. rlist has as dependency XML and only uses 3 functions from it. mlr uses 5 different functions from XML.

pca_weak <- prcomp(deps_wo[!deps_wo$strong, cols_pca], 
                   scale. = TRUE, center = TRUE)
summary(pca_weak)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.4500 1.1578 0.9901 0.63980 0.40895
## Proportion of Variance 0.4205 0.2681 0.1960 0.08187 0.03345
## Cumulative Proportion  0.4205 0.6886 0.8847 0.96655 1.00000
pca_data_weak <- cbind(pca_weak$x, deps_wo[!deps_wo$strong, ])
ggplot(pca_data_weak) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  geom_point(aes(PC1, PC2, col = type, shape = type)) +
  geom_text_repel(aes(PC1, PC2, label = package), 
                  data = filter(pca_data_weak, abs(PC1)> 2 | abs(PC2) > 2)) +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  labs(title = "PCA of packages in CRAN", col = "Type", shape = "Type")

Figure 7: Packages with weak dependency to XML or RCurl.

keep <- deps_wo$repository == "CRAN" & deps_wo$strong
pca_cran <- prcomp(deps_wo[keep, cols_pca], 
                     scale. = TRUE, center = TRUE)
summary(pca_cran)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.4174 1.3060 0.9244 0.51813 0.40278
## Proportion of Variance 0.4018 0.3412 0.1709 0.05369 0.03245
## Cumulative Proportion  0.4018 0.7430 0.9139 0.96755 1.00000
pca_data_strong <- cbind(pca_cran$x, deps_wo[keep, ])
ggplot(pca_data_strong) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  geom_point(aes(PC1, PC2, col = type, shape = type)) +
    geom_text_repel(aes(PC1, PC2, label = package), 
                  data = filter(pca_data_strong, abs(PC1) > 2 | abs(PC2) > 2)) +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  labs(title = "Packages in CRAN", 
       col = "Type", shape = "Type")
## Warning: ggrepel: 26 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Figure 8: PCA of packages on CRAN.

keep <- deps_wo$repository == "Bioconductor"  & deps_wo$strong
pca_bioc <- prcomp(deps_wo[keep, cols_pca], 
                     scale. = TRUE, center = TRUE)
summary(pca_bioc)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.4913 1.3703 0.8495 0.33584 0.25281
## Proportion of Variance 0.4448 0.3755 0.1443 0.02256 0.01278
## Cumulative Proportion  0.4448 0.8203 0.9647 0.98722 1.00000
pca_data_strong <- cbind(pca_bioc$x, deps_wo[keep, ])
ggplot(pca_data_strong) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  geom_point(aes(PC1, PC2, col = type, shape = type)) +
    geom_text_repel(aes(PC1, PC2, label = package), 
                  data = filter(pca_data_strong, abs(PC1) > 2 | abs(PC2) > 2)) +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  labs(title = "Packages in Bioconductor", 
       subtitle = "PCA of numeric variables of strong dependencies",
       col = "Type", shape = "Type")
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Figure 9: PCA of packages on Bioconductor.

GenomeInfoDb is the package that seems more important that only uses the RCurl::getURL function.

Outro

I wanted to explore a bit how these packages got into this position ³.

deps |> 
  filter(strong) |> 
  ggplot() +
  geom_vline(xintercept = as.Date("2013-06-15"), linetype = 2) +
  geom_point(aes(first_release, downloads, col = type, shape = type, 
                 size = first_deps_strong_n)) +
  geom_label(aes(first_release, downloads, label = package),
             data = filter(deps, package %in% c("XML", "RCurl")), show.legend = FALSE) +
  theme_minimal() +
  scale_y_log10(labels = scales::label_log()) +
  annotate("text", x = as.Date("2014-6-15"), y = 5*10^5, 
           label = "CRAN maintained", hjust = 0) +
  labs(x = "Release date", y = "Downloads", 
       title = "More packages added after CRAN maintenance than before",
       subtitle = "Release date and downloads",
       col = "Depends on", shape = "Depends on", size = "Direct strong dependencies") 
## Warning: Removed 34 rows containing missing values (`geom_point()`).

Figure 10: First release of packages in relation to the maintenance by CRAN of XML and RCurl.

Almost the CRAN team have been maintaining these packages longer than the previous maintainer(s?).

Next, we look at the dependencies added after CRAN started maintaining them

summarize(deps_wo,
          before = sum(first_release <= as.Date("2013-06-15"), na.rm = TRUE), 
          later = sum(first_release > as.Date("2013-06-15"), na.rm = TRUE),
          .by = type)

type	before	later
both	14	52
RCurl	21	150
XML	63	156

More packages have been released after CRAN is maintaining it than before. Maybe packages authors trusted the CRAN team for their dependencies or there was no other alternative for the functionality. This might also be explained by the expansion of CRAN (and Bioconductor) with more packages being added each day. However, this places further pressure in the CRAN team to maintain those packages. Removing this burden might free more time for them or to dedicate to CRAN.

A replacement for XML could be xml2, first released in 2015 (which uses the same system dependency libxml2).
A replacement for RCurl could be curl, first released at the end of 2014 (which uses the same system dependency libcurl).

Until their release there were no other replacement for these packages (if there are other packages, please let me know). It is not clear to me if those packages at their first release could replace XML and RCurl.

This highlight the importance of correct replacement of packages in the community. Recent examples are the efforts taken by the spatial community led by Roger Bivand, Edzer Pebesma. Where packages have been carefully designed and planned to replace older packages that are going to be retired soon.

Recomendations

As a final recommendations I think:

Disentangle the XML and RCurl circular dependency.
Evaluate if the xml2 and curl packages provides enough functionality to replace XML and RCurl respectively. If not see what should be added to these packages or how to develop alternative packages to fill the gap if needed.
Maybe a helping documentation about the alternative from XML and RCurl could be written to ease the transition and evaluate if the functionality is covered by these packages.
Contact package maintainers to replace the functionality they currently depend on XML and RCurl as seen in 4 and the maintainers of packages seen in figures 5, 6, 8, and 9.
Set deprecation warnings on the XML and RCurl packages.
Archive XML and RCurl packages in CRAN.

This might take years of moving packages around but I am confident that once the word is out, package developers will avoid XML and RCurl and current maintainers that depend on them will replace them.

Update:

On 2024/01/22 the CRAN team asked for a maintainer of XML

Reproducibility

## - Session info ---------------------------------------------------------------
##  setting  value
##  version  R version 4.3.1 (2023-06-16)
##  os       Ubuntu 22.04.3 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  C
##  ctype    C
##  tz       Europe/Madrid
##  date     2024-01-22
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## - Packages -------------------------------------------------------------------
##  package       * version     date (UTC) lib source
##  Biobase         2.62.0      2023-10-24 [1] Bioconductor
##  BiocFileCache   2.10.1      2023-10-26 [1] Bioconductor
##  BiocGenerics    0.48.1      2023-11-01 [1] Bioconductor
##  BiocManager     1.30.22     2023-08-08 [1] CRAN (R 4.3.1)
##  BiocPkgTools  * 1.20.0      2023-10-24 [1] Bioconductor
##  biocViews       1.70.0      2023-10-24 [1] Bioconductor
##  bit             4.0.5       2022-11-15 [1] CRAN (R 4.3.1)
##  bit64           4.0.5       2020-08-30 [1] CRAN (R 4.3.1)
##  bitops          1.0-7       2021-04-24 [1] CRAN (R 4.3.1)
##  blob            1.2.4       2023-03-17 [1] CRAN (R 4.3.1)
##  blogdown        1.18        2023-06-19 [1] CRAN (R 4.3.1)
##  bookdown        0.37        2023-12-01 [1] CRAN (R 4.3.1)
##  bslib           0.6.1       2023-11-28 [1] CRAN (R 4.3.1)
##  cachem          1.0.8       2023-05-01 [1] CRAN (R 4.3.1)
##  cli             3.6.2       2023-12-11 [1] CRAN (R 4.3.1)
##  codetools       0.2-19      2023-02-01 [2] CRAN (R 4.3.1)
##  colorspace      2.1-0       2023-01-23 [1] CRAN (R 4.3.1)
##  cranlogs      * 2.1.1       2019-04-29 [1] CRAN (R 4.3.1)
##  crul            1.4.0       2023-05-17 [1] CRAN (R 4.3.1)
##  curl            5.2.0       2023-12-08 [1] CRAN (R 4.3.1)
##  DBI             1.2.1       2024-01-12 [1] CRAN (R 4.3.1)
##  dbplyr          2.4.0       2023-10-26 [1] CRAN (R 4.3.2)
##  digest          0.6.34      2024-01-11 [1] CRAN (R 4.3.1)
##  dplyr         * 1.1.4       2023-11-17 [1] CRAN (R 4.3.1)
##  DT              0.31        2023-12-09 [1] CRAN (R 4.3.1)
##  evaluate        0.23        2023-11-01 [1] CRAN (R 4.3.2)
##  fansi           1.0.6       2023-12-08 [1] CRAN (R 4.3.1)
##  farver          2.1.1       2022-07-06 [1] CRAN (R 4.3.1)
##  fastmap         1.1.1       2023-02-24 [1] CRAN (R 4.3.1)
##  fauxpas         0.5.2       2023-05-03 [1] CRAN (R 4.3.1)
##  filelock        1.0.3       2023-12-11 [1] CRAN (R 4.3.1)
##  generics        0.1.3       2022-07-05 [1] CRAN (R 4.3.1)
##  ggplot2       * 3.4.4       2023-10-12 [1] CRAN (R 4.3.1)
##  ggrepel       * 0.9.5       2024-01-10 [1] CRAN (R 4.3.1)
##  gh              1.4.0       2023-02-22 [1] CRAN (R 4.3.1)
##  glue            1.7.0       2024-01-09 [1] CRAN (R 4.3.1)
##  graph           1.80.0      2023-10-24 [1] Bioconductor
##  gtable          0.3.4       2023-08-21 [1] CRAN (R 4.3.1)
##  highr           0.10        2022-12-22 [1] CRAN (R 4.3.1)
##  hms             1.1.3       2023-03-21 [1] CRAN (R 4.3.1)
##  htmltools       0.5.7       2023-11-03 [1] CRAN (R 4.3.2)
##  htmlwidgets   * 1.6.4       2023-12-06 [1] CRAN (R 4.3.1)
##  httpcode        0.3.0       2020-04-10 [1] CRAN (R 4.3.1)
##  httr            1.4.7       2023-08-15 [1] CRAN (R 4.3.1)
##  igraph          1.6.0       2023-12-11 [1] CRAN (R 4.3.1)
##  jquerylib       0.1.4       2021-04-26 [1] CRAN (R 4.3.1)
##  jsonlite        1.8.8       2023-12-04 [1] CRAN (R 4.3.1)
##  knitr         * 1.45        2023-10-30 [1] CRAN (R 4.3.2)
##  labeling        0.4.3       2023-08-29 [1] CRAN (R 4.3.2)
##  lifecycle       1.0.4       2023-11-07 [1] CRAN (R 4.3.2)
##  magrittr        2.0.3       2022-03-30 [1] CRAN (R 4.3.1)
##  memoise         2.0.1       2021-11-26 [1] CRAN (R 4.3.1)
##  munsell         0.5.0       2018-06-12 [1] CRAN (R 4.3.1)
##  pillar          1.9.0       2023-03-22 [1] CRAN (R 4.3.1)
##  pkgconfig       2.0.3       2019-09-22 [1] CRAN (R 4.3.1)
##  purrr           1.0.2       2023-08-10 [1] CRAN (R 4.3.1)
##  R6              2.5.1       2021-08-19 [1] CRAN (R 4.3.1)
##  RBGL            1.78.0      2023-10-24 [1] Bioconductor
##  Rcpp            1.0.12      2024-01-09 [1] CRAN (R 4.3.1)
##  RCurl           1.98-1.14   2024-01-09 [1] CRAN (R 4.3.1)
##  readr           2.1.5       2024-01-10 [1] CRAN (R 4.3.1)
##  rlang           1.1.3       2024-01-10 [1] CRAN (R 4.3.1)
##  rmarkdown       2.25        2023-09-18 [1] CRAN (R 4.3.1)
##  rorcid          0.7.0       2021-01-20 [1] CRAN (R 4.3.1)
##  RSQLite         2.3.5       2024-01-21 [1] CRAN (R 4.3.1)
##  rstudioapi      0.15.0      2023-07-07 [1] CRAN (R 4.3.1)
##  RUnit           0.4.32      2018-05-18 [1] CRAN (R 4.3.1)
##  rvest           1.0.3       2022-08-19 [1] CRAN (R 4.3.1)
##  sass            0.4.8       2023-12-06 [1] CRAN (R 4.3.1)
##  scales          1.3.0       2023-11-28 [1] CRAN (R 4.3.1)
##  sessioninfo     1.2.2       2021-12-06 [1] CRAN (R 4.3.1)
##  stringi         1.8.3       2023-12-11 [1] CRAN (R 4.3.1)
##  stringr         1.5.1       2023-11-14 [1] CRAN (R 4.3.1)
##  tibble          3.2.1       2023-03-20 [1] CRAN (R 4.3.1)
##  tidyselect      1.2.0       2022-10-10 [1] CRAN (R 4.3.1)
##  tzdb            0.4.0       2023-05-12 [1] CRAN (R 4.3.1)
##  utf8            1.2.4       2023-10-22 [1] CRAN (R 4.3.2)
##  vctrs           0.6.5       2023-12-01 [1] CRAN (R 4.3.1)
##  whisker         0.4.1       2022-12-05 [1] CRAN (R 4.3.1)
##  withr           3.0.0       2024-01-16 [1] CRAN (R 4.3.1)
##  xfun            0.41        2023-11-01 [1] CRAN (R 4.3.2)
##  XML             3.99-0.16.1 2024-01-22 [1] CRAN (R 4.3.1)
##  xml2            1.3.6       2023-12-04 [1] CRAN (R 4.3.1)
##  yaml            2.3.8       2023-12-11 [1] CRAN (R 4.3.1)
## 
##  [1] /home/lluis/bin/R/4.3.1
##  [2] /opt/R/4.3.1/lib/R/library
## 
## ------------------------------------------------------------------------------

the maintainer function only works for installed packages, and I don’t have all these packages installed.↩︎
Both logs only count those of their repository and not from other mirrors or approaches (RSPM, bspm, r2u, ….).↩︎
I recently found this as opposite of introduction/intro.↩︎

experDesign: follow up

Sun, 09 Apr 2023 00:00:00 +0000

I am happy to announce a new release of experDesign. Install it from CRAN with:

install.packages("experDesign")
library("experDesign")

This new release has focused in more tricky aspects when designing an experiment:

Checking the samples of your experiment.
How to continue stratifying your conditions after some initial batch.

These functions should be used before carrying out anything once you have your samples collected. You can use these functions and make an informed decision of what might happen with your experiment.

Checking your samples

The new function check_data() will warn you if it finds some known issues with your data.

library("experDesign")
library("MASS")

If we take the survey dataset from the MASS package we can see that it has some issues:

data(survey, package = "MASS")
check_data(survey)
## Warning: Two categorical variables don't have all combinations.
## Warning: Some values are missing.
## Warning: There is a combination of categories with no replicates; i.e. just one
## sample.
## [1] FALSE

While if we fabricate our own dataset we might realize we have a problem

rdata <- expand.grid(sex = c("M", "F"), class = c("lower", "median", "high"))
stopifnot("Same samples/rows as combinations of classes" = nrow(rdata) == 2*3)
check_data(rdata)
## Warning: There is a combination of categories with no replicates; i.e. just one
## sample.
## [1] FALSE
# We create some new samples with the same conditions
rdata2 <- rbind(rdata, rdata)
check_data(rdata2)
## [1] TRUE

One might decide to go ahead with what is available or use only some of those samples or wait to collect more samples for the experiment

Follow up

Imagine you have 100 samples that you distribute in 4 batches of 25 samples each. Later, you collect 80 more samples to analyze. You want these new samples to be analyzed together with those previous 100 samples. Will it be possible? How should you distribute your new samples in groups of 25?

Using the same dataset from MASS imagine if we first collected 118 observations and later 119 more:

survey1 <- survey[1:118, ]
survey2 <- survey[119:nrow(survey), ]
# Using low number of iterations to speed the process 
# you should even use higher number than the default
fu <- follow_up(survey1, survey2, size_subset = 50, iterations = 10)
## Warning: There are some problems with the data.
## Warning: There are some problems with the new samples and the batches.
## Warning: There are some problems with the new data.
## Warning: There are some problems with the old data.

Following the previous new function it reports if there are problems with the observations. One can check each collection with check_data to know more about the problems found.

If you have already performed the experiment on your observations you can also check the distribution:

# Create the first batch
variables <- c("Sex", "Smoke", "Age")
survey1 <- survey1[, variables]
index1 <- design(survey1, size_subset = 50, iterations = 10)
## Warning: There might be some problems with the data use check_data().
r_survey <- inspect(index1, survey1)
# Create the second batch with "new" students
survey2 <- survey2[, variables]
survey2$batch <- NA
# Prepare the follow up
all_classroom <- rbind(r_survey, survey2)
fu2 <- follow_up2(all_classroom, size_subset = 50, iterations = 10)
## Warning: There are some problems with the data.
## Warning: There are some problems with the new samples and the batches.
## Warning: There are some problems with the new data.
## Warning: There are some problems with the old data.
tail(fu2)
## [1] "NewSubset2" "NewSubset2" "NewSubset2" "NewSubset2" "NewSubset2"
## [6] "NewSubset3"

Using this function will help to decide which new observations go to which new batches.

Closing remarks

The famous quote from Fisher goes:

“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”

This emphasizes the importance of involving a statistician early on in the experimental design process.
Unfortunately, in some cases, it may be too late to involve a statistician in the experimental design process or the reality of unforeseen circumstances messed the design of your carefully planned experiment.

My aim with this package is to provide practical tools for statisticians, bioinformaticians, and anyone who works with data. These tools are designed to be easy to use and can be used to analyze data in a variety of contexts. Let me know if it is helpful in your case.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.2.2 (2022-10-31)
##  os       Ubuntu 22.04.2 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language en_US
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Madrid
##  date     2023-04-09
##  pandoc   2.19.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version  date (UTC) lib source
##  blogdown      1.16     2022-12-13 [1] CRAN (R 4.2.2)
##  bookdown      0.33     2023-03-06 [1] CRAN (R 4.2.2)
##  bslib         0.4.2    2022-12-16 [1] CRAN (R 4.2.2)
##  cachem        1.0.7    2023-02-24 [1] CRAN (R 4.2.2)
##  cli           3.6.1    2023-03-23 [1] CRAN (R 4.2.2)
##  digest        0.6.31   2022-12-11 [1] CRAN (R 4.2.2)
##  evaluate      0.20     2023-01-17 [1] CRAN (R 4.2.2)
##  experDesign * 0.2.0    2023-04-05 [1] CRAN (R 4.2.2)
##  fastmap       1.1.1    2023-02-24 [1] CRAN (R 4.2.2)
##  htmltools     0.5.4    2022-12-07 [1] CRAN (R 4.2.2)
##  jquerylib     0.1.4    2021-04-26 [1] CRAN (R 4.2.2)
##  jsonlite      1.8.4    2022-12-06 [1] CRAN (R 4.2.2)
##  knitr         1.42     2023-01-25 [1] CRAN (R 4.2.2)
##  MASS        * 7.3-58.1 2022-08-03 [2] CRAN (R 4.2.2)
##  R6            2.5.1    2021-08-19 [1] CRAN (R 4.2.2)
##  rlang         1.1.0    2023-03-14 [1] CRAN (R 4.2.2)
##  rmarkdown     2.20     2023-01-19 [1] CRAN (R 4.2.2)
##  rstudioapi    0.14     2022-08-22 [1] CRAN (R 4.2.2)
##  sass          0.4.5    2023-01-24 [1] CRAN (R 4.2.2)
##  sessioninfo   1.2.2    2021-12-06 [1] CRAN (R 4.2.2)
##  xfun          0.37     2023-01-31 [1] CRAN (R 4.2.2)
##  yaml          2.3.7    2023-01-23 [1] CRAN (R 4.2.2)
## 
##  [1] /home/lluis/bin/R/4.2.2
##  [2] /opt/R/4.2.2/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Releasing rtweet 1.2.0

Mon, 20 Mar 2023 00:00:00 +0000

I’m very excited to announce that rtweet 1.2.0 is now available on GitHub! Install it by running:

devtools::install_github("ropensci/rtweet")

Then load it in a fresh session with:

library(rtweet)

New features

This version adds many new endpoints to retrieve data from twitter:

From lists
From tweets
About users
Also, about statistics of your own content.

You can read about them in the NEWS.

Authentication

Besides fixing a problem preventing new users to use auth_setup_default(), in this release there is a new authentication mechanism.

Some endpoints require a new authentication method not previously used by rtweet. This authentication mechanism requires setting up a client.
To support it, I have added some functions to create it, save it, and use it modelling the functions from auth_*. There is now one client provided by rtweet if you don’t want to configure your own:

client_setup_default()

Additionally, I briefly expanded the authentication vignette (vignette("auth", "rtweet")) to include a section about how to obtain the required credentials. Once you get them is pretty straight forward:

auth_oauth2 <- rtweet_oauth2(app = "my_awesome_app")

This mechanism is required by some functions which are of special interest: user_self(), tweet_bookmarked(), user_blocked(), and user_timeline().

Note that due to upstream reasons, the authentication is only valid for 2 hours. You will be asked to approve the client again after the 2 hours (and save it again!).

We can set the authentication as we usually do:

auth_as(auth_oauth2)

And start retrieving our data in Twitter!

New functions

me <- user_self()
bookmarked <- user_bookmarks(me$id, n = 120)

rtweet will make as many requests as needed and automatically paginate the results. However, if you try this you might realize that the queries are slow. These are the limits imposed by Twitter.

If you want to keep track of the progress of your query, you can use verbose = TRUE:

blocked <- user_blocked(me$id, n = Inf, verbose = TRUE)
timeline <- user_timeline(me$id, n = 800, verbose = TRUE)

It will also store the data of the requests in a temporary file, in case you lose the connection you can still recover it.

Some endpoints have a length limit on the accepted input:

bioconductor <- user_by_username("Bioconductor")
bioconductor_followers <- user_followers(bioconductor$id, n = 200)
us <- user_search(ids = bioconductor_followers$id)

Errors are, in principle, easier to understand in these new functions, thanks to the messages provided:

user_blocked(bioconductor$id, n = Inf, verbose = TRUE)

Other

These new endpoints provide access to many data which only the default information is converted to a nice table. If you request more data, via expansions and fields: replies, information about the user of a tweet, … you will have to wait next release.
You can already get them but with parse = FALSE. My intention was to provide more parsing support in this release, but I think it is better to make more releases more often.

The API also provides an endpoint to check if data stored is compliant with the Terms of Service. I started working on these endpoints after the streaming endpoints because they are important.

Side story

With the deprecation of the streaming endpoints and the function stream_tweets I implemented the first three functions using Twitter’s API v2. They use a bearer token as authentication mechanism.

Many endpoints of API v2 also use this authentication mechanism and made it easy to support them.

But there was a petition to retrieve the bookmarked tweets. That endpoint required OAuth2 mechanism and didn’t allow the use of the bearer token. It is relevant because bookmarks are not provided with the data dump you can request from Twitter. This endpoint is the only automatic way to retrieve them from Twitter if you used them!

Final

This update is the last update of rtweet. The new API plans make it impossible to continue developing and testing software like rtweet without substantial financial investment (at least USD$100/month).

More importantly, this will restrict who can use the package. I think those few users still using rtweet might also afford to pay for support or development of new features. If you are one of them you can sponsor my work in rtweet.

I will remove the package from CRAN one month after the new API enters into effect (~ 1st July). Farewell Twitter.

rtweet future

Thu, 16 Feb 2023 00:00:00 +0000

Background: how I became the maintainer of rtweet

I didn’t want to maintain rtweet. It might sound strange coming from its maintainer, but I didn’t want the responsibility of writting software 12k people install it monthly. My offer was always to help with it so that the users could benefit from improvements and bug fixes on the package. I initially thought that having permissions to close issues, label them would help the community and the maintainer.

I was not totally altruistic. As recently recommended by Maëlle and Steffi, I had some interest in the package. I had a bot posting some plots daily. I wanted to have alt text in the tweets. There was a pending PR for just that. I could fork and use my own version or help with the package.

I got in touch with rOpenSci about the package. After some time waiting hearing back from the author of the package (Many thanks Michael W. Kearney, I was given edit permissions to the repository (under the helping eye of Scott Chamberlain). At the time, I got permission from rOpenSci there were 167 issues and PR open.

As I started going through them and changing the source code, people showed up to help (Thanks!). New contributors helped with closing issues via new PR or simply advising about naming functions or asking for its future. Another developer gained access to repository and contributed with their expertise (resulting in some funny moments). Despite all (breaking) changes, we tried to make the package easier to maintain. After waiting some time, for further community feedback, I released a new version through CRAN.

Shortly after, for some reasons my bot got less views and engagement. It made me realize that it might be an opportunity to start a professional project around the content of what this bot did. In addition, although I learn a lot from the online communities I decided to connect more with the local community. I am currently involved with re-launching the Barcelona R user group and organizing the Spanish R conference. These activities take time an energy away from package maintenance (not just rtweet). However, my support for the users of rtweet remains.

The community support I received was great. I hope to pass on it. CC-BY-4.0 Artwork by @allison_horst.

Current situation: supporting the v1 and adding support to the v2 API

To maintain rtweet I needed to adapt and understand the Twitter’s API (v1) documentation. The documentation didn’t always match the functionality. rtweet follows the API tightly, any change might break the package. For instance, the addition of the edit field broke the parsing of the tweets.

At the same time one of the oldest issues in rtweet is moving to the new API v2. With the 1.0 release, I decided it was time to stop adding features relying in it. It was clear that Twitter was moving to deprecate and replace it with the newer API. There are many benefits of the new API to developer and users. But as I realized that I should still fix bugs related to them.

The 1.1 release have recently provided some bug fixes and support for the v2. There is also an outstanding issue preventting new users of connecting to Twitter (fixed in github). The streaming endpoint in v1 stopped working before the announced date. So the new release included support for the replacement endpoint in v2.

As you have guessed, between releases I had been thinking and working to support API v2. The foundations to support the new endpoints were already set and easily expanded to new endpoints. In the development version of the package (in the devel branch), it is possible to retrieve bookmarks, retrieve the archive if you have academic access, among other endpoints.

Going forward: supporting rtweet users of v2 API

Recently (~3 weeks), there have been anouncements of future changes in who and how will be able to access the API. Simultaneously, there have been unannounced restrictions affecting other tools using the API. Recent changes and announcements are driven by need of money to sustain Twitter. To maintain rtweet with its current functionality it might need funds.

I will continue supporting rtweet and the freely accessible endpoints. But given that I will have less time and energy for rtweet, I am looking for a co-maintainer to help me:

Supporting new endpoints, using httr2, testing an API in CI, …
Review changes to avoid new bugs.
Help with issues and questions that the transition to API v2 and the current uncertainty bring.

Get in touch with me in this issue.

From time to time, I receive bug reports and petitions from premium API users. Currently, premium users get access to more data, elevated request rates and other endpoints. Helping them is usually challenging, and developing support for these endpoints will be difficult. To continue supporting these users I’ve set up a ☕ buymeacoffee and I, and my co-maintainer, will be open to consulting jobs about rtweet. With these jobs and funding we will be able to support them and implement new endpoints for all the users.

There is scarce information about the API changes and prices. Changes might come suddenly. We’ll do our best to keep up and inform all users. I hope this will be a good way to continue supporting the package an the community of users. Let me know if you have other suggestions.

Decorating/contributing to R. CC-BY-4.0 artwork by @allison_horst.

Thanks Maëlle for your support and reviewing the post.