<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>CRAN | B101nfo</title>
    <link>https://llrs.dev/tags/cran/</link>
      <atom:link href="https://llrs.dev/tags/cran/index.xml" rel="self" type="application/rss+xml" />
    <description>CRAN</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>If it is code you can copy and reuse (MIT) if it is text, please cite and reuse CC-BY 2024.</copyright><lastBuildDate>Wed, 10 Jan 2024 00:00:00 +0000</lastBuildDate>
    <image>
      <url>img/map[gravatar:%!s(bool=false) shape:circle]</url>
      <title>CRAN</title>
      <link>https://llrs.dev/tags/cran/</link>
    </image>
    
    <item>
      <title>Submissions accepted on the first try</title>
      <link>https://llrs.dev/post/2024/01/10/submission-cran-first-try/</link>
      <pubDate>Wed, 10 Jan 2024 00:00:00 +0000</pubDate>
      <guid>https://llrs.dev/post/2024/01/10/submission-cran-first-try/</guid>
      <description>


&lt;p&gt;Recently someone in social media was saying that they do not succeed on submissions to CRAN on the first try.
In this post I’ll try to answer this question.&lt;/p&gt;
&lt;p&gt;First we need to know the submissions to CRAN.
We can download the last 3 years of CRAN submissions thanks to &lt;a href=&#34;https://r-hub.github.io/cransays/articles/dashboard.html&#34;&gt;cransays&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cdh &amp;lt;- cransays::download_history()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here is the bulk of the analysis of the history of package submissions.
This is explained in different posts, but basically I keep only one package per snapshot, try to identify new submissions instead of changes in the same submission and calculate some date-related variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;dplyr&amp;quot;, warn.conflicts	 = FALSE)
library(&amp;quot;lubridate&amp;quot;, warn.conflicts	 = FALSE)
library(&amp;quot;tidyr&amp;quot;, warn.conflicts	 = FALSE)
diff0 &amp;lt;- structure(0, class = &amp;quot;difftime&amp;quot;, units = &amp;quot;hours&amp;quot;)
cran &amp;lt;- cdh |&amp;gt; 
  filter(!is.na(version)) |&amp;gt; 
  distinct() |&amp;gt; 
  arrange(package, snapshot_time) |&amp;gt; 
  group_by(package, snapshot_time) |&amp;gt; 
  # Remove some duplicated packages in different folders
  mutate(n = seq_len(n())) |&amp;gt; 
  filter(n == n()) |&amp;gt; 
  ungroup() |&amp;gt; 
  select(-n) |&amp;gt; 
  arrange(package, snapshot_time, version) |&amp;gt; 
  # Packages last seen in queue less than 24 ago are considered same submission 
  # (even if their version number differs)
  mutate(diff_time = difftime(snapshot_time, lag(snapshot_time), units = &amp;quot;hour&amp;quot;),
         diff_time = if_else(is.na(diff_time), diff0, diff_time), # Fill NAs
         diff_v = version != lag(version),
         diff_v = if_else(is.na(diff_v), TRUE, diff_v), # Fill NAs
         near_t = abs(diff_time) &amp;lt;= 24,
         resubmission = !near_t | diff_v, 
         resubmission = if_else(resubmission == FALSE &amp;amp; diff_time == 0, 
                               TRUE, resubmission),
         resubmission_n = cumsum(as.numeric(resubmission)),
         new_version = !near(diff_time, 1, tol = 24) &amp;amp; diff_v, 
         new_version = if_else(new_version == FALSE &amp;amp; diff_time == 0, 
                               TRUE, new_version),
         submission_n = cumsum(as.numeric(new_version)), .by = package) |&amp;gt; 
  select(-diff_time, -diff_v, -new_version, -new_version, -near_t) |&amp;gt; 
  mutate(version = package_version(version, strict = FALSE),
         date = as_date(snapshot_time))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we need to compare with the CRAN archive to know if the submission were accepted.&lt;/p&gt;
&lt;p&gt;First we need to retrieve the data:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_archive &amp;lt;- tools:::CRAN_archive_db()
# When row binding the data.frames that have only one row lose they row name:
# handle those cases to keep the version number:
archived &amp;lt;- vapply(cran_archive, NROW, numeric(1L))
names(cran_archive)[archived == 1L] &amp;lt;- vapply(cran_archive[archived == 1L], rownames, character(1L))
# Merge current and archive data
cran_dates &amp;lt;- do.call(rbind, cran_archive)
cran_dates$type &amp;lt;- &amp;quot;archived&amp;quot;
current &amp;lt;- tools:::CRAN_current_db()
current$type &amp;lt;- &amp;quot;available&amp;quot;
cran_h &amp;lt;- rbind(current, cran_dates)
# Keep minimal CRAN data archive
cran_h$pkg_v &amp;lt;- basename(rownames(cran_h))
rownames(cran_h) &amp;lt;- NULL
cda &amp;lt;- cran_h |&amp;gt; 
  mutate(strcapture(x = pkg_v, &amp;quot;^(.+)_([0-9]*.+).tar.gz$&amp;quot;, 
                    proto = data.frame(package = character(), version = character())),
         package = if_else(is.na(package), pkg_v, package)) |&amp;gt; 
  arrange(package, mtime) |&amp;gt; 
  mutate(acceptance_n = seq_len(n()), .by = package) |&amp;gt; 
  select(package, pkg_v, version, acceptance_n, date = mtime, uname, type) |&amp;gt; 
  mutate(date = as_date(date))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We use &lt;code&gt;tools:::CRAN_current_db&lt;/code&gt;, because &lt;code&gt;package.available&lt;/code&gt; will filter packages based on OS and other options (see the &lt;code&gt;filter&lt;/code&gt; argument).&lt;/p&gt;
&lt;p&gt;We can make a quick detour to plot the number of accepted articles and when were they first published:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;ggplot2&amp;quot;)
cdas &amp;lt;- cda |&amp;gt; 
  summarize(available = if_else(any(type == &amp;quot;available&amp;quot;), &amp;quot;available&amp;quot;, &amp;quot;archived&amp;quot;),
            published = min(date),
            n_published = max(acceptance_n),
            .by = package)

ggplot(cdas) + 
  geom_point(aes(published, n_published, col = available, shape = available)) +
  theme_minimal() +
  theme(legend.position = c(0.7, 0.8), legend.background = element_rect()) +
  labs(x = element_blank(), y = &amp;quot;Versions&amp;quot;, col = &amp;quot;Status&amp;quot;, shape = &amp;quot;Status&amp;quot;,
       title = &amp;quot;First publication of packages and versions published&amp;quot;) +
  scale_x_date(expand = expansion(), date_breaks = &amp;quot;2 years&amp;quot;, date_labels = &amp;quot;%Y&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2024/01/10/submission-cran-first-try/index.en_files/figure-html/cran-published-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In summary, there are 6291 packages archived, and 20304 available.
We can observe that there is a package that had more than 150 versions that was later archived.&lt;/p&gt;
&lt;p&gt;Now we can really compare the submission process with the CRAN archive:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_subm &amp;lt;- cran |&amp;gt; 
  summarise(
    resubmission_n = max(resubmission_n, na.rm = TRUE),
    submission_n = max(submission_n, na.rm = TRUE),
    # The number of submissions 
    submissions = resubmission_n - submission_n + 1,
    date = min(date),
    .by = c(&amp;quot;package&amp;quot;, &amp;quot;version&amp;quot;)) |&amp;gt; 
  arrange(package, version)
# Filter to those packages submitted in the period we have data
cda_acc &amp;lt;- cda |&amp;gt; 
  filter(date &amp;gt;= min(cran_subm$date)) |&amp;gt; 
  select(-pkg_v) |&amp;gt; 
  mutate(version = package_version(version, FALSE))

accepted_subm &amp;lt;- merge(cda_acc, cran_subm, by = c(&amp;quot;package&amp;quot;, &amp;quot;version&amp;quot;),
             suffixes = c(&amp;quot;.cran&amp;quot;, &amp;quot;.subm&amp;quot;), all = TRUE, sort = FALSE) |&amp;gt; 
  arrange(package, version, date.cran, date.subm) |&amp;gt; 
  mutate(submissions = if_else(is.na(submissions), 1, submissions),
         acceptance_n = if_else(is.na(acceptance_n), 0, acceptance_n))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can explore a little bit this data:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lp &amp;lt;- scales::label_percent(accuracy = 0.1)
accepted_subm |&amp;gt; 
  summarize(cransays = sum(!is.na(date.subm)),
            cran = sum(!is.na(date.cran)),
            missed_submissions = cran - cransays,
            percentaged_missed = lp(missed_submissions/cran))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;cransays&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;cran&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;missed_submissions&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;percentaged_missed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;46525&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;50413&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3888&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;7.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This means that &lt;a href=&#34;https://r-hub.github.io/cransays/articles/dashboard.html&#34;&gt;cransays&lt;/a&gt;, the package used to archive this data, misses ~8% of submissions, probably because they are handled in less than an hour!!
Another explanation might be because for some periods cransays bot didn’t work well…&lt;/p&gt;
&lt;p&gt;On the other hand we can look how long does it take for a version to be published on CRAN:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;accepted_subm |&amp;gt; 
  filter(!is.na(date.cran)) |&amp;gt; 
  mutate(time_diff = difftime(date.cran, date.subm, units = &amp;quot;weeks&amp;quot;)) |&amp;gt;
  # Calculate the number of accepted packages sine the recording of submissions
  mutate(accepted_n = acceptance_n - min(acceptance_n[acceptance_n != 0L], na.rm = TRUE) + 1, .by = package) |&amp;gt; 
  filter(time_diff &amp;gt;= 0) |&amp;gt; 
  ggplot() + 
  geom_point(aes(date.cran, time_diff, col = accepted_n)) +
  theme_minimal() +
  theme(legend.position = c(0.2, 0.8), legend.background = element_rect()) +
  labs(x = &amp;quot;Published on CRAN&amp;quot;, title = &amp;quot;Time since submitted to CRAN&amp;quot;, 
       y = &amp;quot;Weeks&amp;quot;, col = &amp;quot;Accepted&amp;quot;)
## Don&amp;#39;t know how to automatically pick scale for object of type &amp;lt;difftime&amp;gt;.
## Defaulting to continuous.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2024/01/10/submission-cran-first-try/index.en_files/figure-html/accepted_subm-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I explored some of those outliers and there is a package that was submitted in 2021 and two years later it was submitted with the same version.
In other cases the submission was done with more than 1 hour of tolerance (see the “new_version” variable creation in the second code chunk.)&lt;/p&gt;
&lt;p&gt;This means that the path to CRAN might be long and that developers do not change the version number on each submission.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This section is new after detecting problems with the way it was initially published.&lt;/p&gt;
&lt;p&gt;In the following function I calculate the number of submissions and similar information for each package:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;count_submissions &amp;lt;- function(x) {
  x |&amp;gt; 
    mutate(submission_in_period = seq_len(n()),
           date.mix = pmin(date.cran, date.subm, na.rm = TRUE),
           .by = package, .after = acceptance_n) |&amp;gt; 
    summarise(
      # Number of accepted packages on CRAN
      total_accepted = sum(!is.na(date.cran), 0, na.rm = TRUE),
      # At minimum 0 through {cransays}
      through_cransays = sum(!is.na(date.subm), 0, na.rm = TRUE), 
      # In case same version number is submitted at different timepoints
      resubmissions = ifelse(any(!is.na(resubmission_n)), 
                              max(resubmission_n, na.rm = TRUE) - min(resubmission_n, na.rm = TRUE) - through_cransays, 0),
      resubmissions = if_else(resubmissions &amp;lt; 0L, 0L, resubmissions),
      # All submission + those that were duplicated on the submission system
      total_submissions = max(submission_in_period, na.rm = TRUE) + resubmissions,
      # The submissions that were not successful
      total_attempts = total_submissions - total_accepted,
      percentage_failed_submissions = lp(total_attempts/total_accepted), 
      .by = package)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I created a function to apply the same logic in whatever group I want to analyse.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Another relevant edit was that the selection criteria changed as I missed some packages in some analysis and included other that shouldn’t be.
Now we are ready to apply to those that got the first version of the package on CRAN:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;first_submissions &amp;lt;- accepted_subm |&amp;gt; 
  group_by(package) |&amp;gt; 
  # Keep submission that where eventually accepted
  filter(length(acceptance_n != 0L) &amp;gt; 1L &amp;amp;&amp;amp; any(acceptance_n[acceptance_n != 0L] == 1)) |&amp;gt; 
  # Keep submissions until the first acceptance but not after
  filter(cumsum(acceptance_n) &amp;lt;= 1L &amp;amp; seq_len(n()) &amp;lt;= which(acceptance_n == 1L)) |&amp;gt; 
  ungroup()
ffs &amp;lt;- first_submissions |&amp;gt;   
  count_submissions() |&amp;gt; 
  count(total_attempts, sort = TRUE,  name = &amp;quot;packages&amp;quot;) |&amp;gt; 
  mutate(percentage = lp(packages/sum(packages, na.rm = TRUE)))
ffs&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;total_attempts&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;packages&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;percentage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3390&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;65.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1141&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;21.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;425&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;8.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;138&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;72&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;23&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;12&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;7&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;8&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;9&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;12&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;16&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This means that close to 35.0% first time submissions are rejected.
Including those that are not yet (never?) included on CRAN (~1000).&lt;/p&gt;
&lt;p&gt;This points out a problem:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the developers need to resubmit their packages and fix it more.&lt;/li&gt;
&lt;li&gt;the reviewers need to spend more time (approximately 50% of submissions are at one point or another handled by a human).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After this exercise we might wonder whether this is just for new packages?&lt;br /&gt;
If we look up those submissions that are not the first version of a package, we find the following:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;submissions_with_accepted &amp;lt;- accepted_subm |&amp;gt; 
  # Filter those that were included on CRAN (not all submission rejected)
  filter(any(acceptance_n &amp;gt;= 1), .by = package) |&amp;gt; 
  mutate(date.mix = pmin(date.cran, date.subm, na.rm = TRUE)) |&amp;gt; 
  group_by(package) |&amp;gt; 
  arrange(date.mix) |&amp;gt; 
  filter(
    # Those that start by 0 but next acceptance is 1 or higher
     cumsum(acceptance_n) &amp;gt;= 1L | 
       min(acceptance_n[acceptance_n != 0L], na.rm = TRUE) &amp;gt;= 2) |&amp;gt; 
  ungroup() 
fs_exp &amp;lt;- count_submissions(submissions_with_accepted)
fs_exp |&amp;gt; 
  count(more_failed = total_accepted &amp;gt; total_attempts, 
            sort = TRUE, name = &amp;quot;packages&amp;quot;) |&amp;gt; 
  mutate(percentage = lp(packages/sum(packages)))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;more_failed&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;packages&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;percentage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;15337&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;96.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;FALSE&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;600&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Still the majority of packages have more attempts than versions released in the period analysed.
Failing the checks on CRAN is normal, but how many more attempts are to CRAN?&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;ggrepel&amp;quot;)
ggplot(fs_exp) +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  geom_count(aes(total_accepted, total_attempts)) +
  geom_label_repel(aes(total_accepted, total_attempts, label = package), data = . %&amp;gt;% filter(total_attempts &amp;gt;= 10)) +
  labs(x = &amp;quot;CRAN versions&amp;quot;, y = &amp;quot;Rejected submissions&amp;quot;,  size = &amp;quot;Packages&amp;quot;,
       title = &amp;quot;Packages after the first version&amp;quot;, subtitle = &amp;quot;for the period analyzed&amp;quot;) +
  scale_size(trans = &amp;quot;log10&amp;quot;) +
  theme_minimal() +
  theme(legend.position = c(0.8, 0.7), legend.background = element_rect())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2024/01/10/submission-cran-first-try/index.en_files/figure-html/failed-exp-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can see that there are packages with more than 30 versions on CRAN in these 3 years which never had a rejected submission.
Congratulations!!&lt;/p&gt;
&lt;p&gt;Others have a high number of submissions rejected, and very few versions:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fs_exp |&amp;gt; 
  count(total_attempts &amp;gt; total_accepted, name = &amp;quot;packages&amp;quot;) |&amp;gt; 
  mutate(percentage = lp(packages/sum(packages)))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;total_attempts &amp;gt; total_accepted&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;packages&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;percentage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;FALSE&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;15792&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;99.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;145&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Close to 1% require more than double submissions per version.&lt;/p&gt;
&lt;p&gt;Last we can see the overall experience for developers:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fs &amp;lt;- count_submissions(accepted_subm)

ggplot(fs) +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  geom_count(aes(total_accepted, total_attempts)) +
  geom_label_repel(aes(total_accepted, total_attempts, label = package), 
                   data = . %&amp;gt;% filter(total_attempts &amp;gt;= 12)) +
  labs(x = &amp;quot;CRAN versions&amp;quot;, y = &amp;quot;Rejected submissions&amp;quot;,  size = &amp;quot;Packages&amp;quot;,
       title = &amp;quot;All packages submissions&amp;quot;, subtitle = &amp;quot;for the period analyzed ~174 weeks&amp;quot;) +
  theme_minimal() +
  scale_size(trans = &amp;quot;log10&amp;quot;) +
  theme(legend.position = c(0.8, 0.7), legend.background = element_rect())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2024/01/10/submission-cran-first-try/index.en_files/figure-html/plot-failed-submissions-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It doesn’t change much between the experienced.
Note that this only add the packages that were not approved ever and the submissions done to be first accepted.
So the changes should only be observable on the bottom left corner of the plot.&lt;/p&gt;
&lt;p&gt;Overall, 14.5% of the attempts end up being rejected.&lt;/p&gt;
&lt;div id=&#34;main-take-away&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Main take away&lt;/h2&gt;
&lt;p&gt;Submitting to CRAN is not easy on the first try, and it usually requires 2 submissions for each accepted version.&lt;br /&gt;
While &lt;a href=&#34;https://cran.r-project.org/doc/manuals/r-devel/R-exts.html&#34;&gt;Writing R extensions&lt;/a&gt; document is clear, it might be too extensive for many cases.&lt;br /&gt;
The &lt;a href=&#34;https://cran.r-project.org/web/packages/policies.html&#34;&gt;CRAN policy&lt;/a&gt; is short but might not be clear enough for new maintainers.&lt;br /&gt;
A document in the middle might be &lt;a href=&#34;https://r-pkgs.org/&#34;&gt;R packages&lt;/a&gt; but it is still extensive and focused on only a small opionated set of packages.&lt;br /&gt;
A CRAN Task View or some training might be a good solution to reduce the overall problem.&lt;br /&gt;
For those maintainers struggling, maybe clearer technical or editorial decisions might be a good solution.&lt;/p&gt;
&lt;p&gt;In addition, it seems that packages having more problems with the submissions are not new: experienced maintainers have troubles getting their package accepted when submitting them.&lt;br /&gt;
This might hint to troubles replicating the CRAN checks or environments or the scale of the checks (dependency checks).&lt;br /&gt;
Maybe focusing on helping those packages’ maintainer might provide a good way to help CRAN maintainers reduce their load.&lt;/p&gt;
&lt;p&gt;I also want to comment that this analysis could be improved if we knew, whether the rejection was automatic or manual.&lt;br /&gt;
This would allow to see the burden on CRAN volunteers and perhaps define better the problem and propose better solutions.&lt;br /&gt;
It could be attempted by looking the last folder of a package in the submission process, but it would still not be clear what the most frequent problem is.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;bonus&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Bonus&lt;/h2&gt;
&lt;p&gt;From all the new packages more than half are already archived (with either newer versions or totally):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;accepted_subm |&amp;gt; 
  filter(acceptance_n == 1L) |&amp;gt; 
  count(status = type, name = &amp;quot;packages&amp;quot;) |&amp;gt; 
  mutate(percentage = lp(packages/sum(packages)))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;status&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;packages&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;percentage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;archived&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4763&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;65.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;available&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2517&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;34.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Of them:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fully_archived &amp;lt;- accepted_subm |&amp;gt;
  filter(acceptance_n != 0L) |&amp;gt; 
  filter(any(acceptance_n == 1L), .by = package) |&amp;gt; 
  summarize(archived = all(type == &amp;quot;archived&amp;quot;), .by = package) |&amp;gt; 
  count(archived, name = &amp;quot;packages&amp;quot;) |&amp;gt; 
  mutate(percentage = lp(packages/sum(packages)))
fully_archived&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;archived&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;packages&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;percentage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;FALSE&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6783&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;93.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;497&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Only 6.8% of packages were fully archived at the end of this period 2020-09-12, 2024-01-20.&lt;/p&gt;
&lt;div id=&#34;reproducibility&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Reproducibility&lt;/h3&gt;
&lt;details&gt;
&lt;pre&gt;&lt;code&gt;## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.1 (2023-06-16)
##  os       Ubuntu 22.04.3 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Madrid
##  date     2024-01-20
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.18    2023-06-19 [1] CRAN (R 4.3.1)
##  bookdown      0.37    2023-12-01 [1] CRAN (R 4.3.1)
##  bslib         0.6.1   2023-11-28 [1] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.1)
##  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.1)
##  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.1)
##  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
##  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.1)
##  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.2)
##  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.1)
##  farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.1)
##  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
##  generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.1)
##  ggplot2     * 3.4.4   2023-10-12 [1] CRAN (R 4.3.1)
##  ggrepel     * 0.9.5   2024-01-10 [1] CRAN (R 4.3.1)
##  glue          1.7.0   2024-01-09 [1] CRAN (R 4.3.1)
##  gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.1)
##  highr         0.10    2022-12-22 [1] CRAN (R 4.3.1)
##  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.3.1)
##  jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
##  knitr       * 1.45    2023-10-30 [1] CRAN (R 4.3.2)
##  labeling      0.4.3   2023-08-29 [1] CRAN (R 4.3.2)
##  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.2)
##  lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
##  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.1)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.1)
##  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.1)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.1)
##  purrr         1.0.2   2023-08-10 [1] CRAN (R 4.3.1)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
##  Rcpp          1.0.12  2024-01-09 [1] CRAN (R 4.3.1)
##  rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.1)
##  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
##  sass          0.4.8   2023-12-06 [1] CRAN (R 4.3.1)
##  scales        1.3.0   2023-11-28 [1] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
##  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.3.1)
##  tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.1)
##  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.1)
##  timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.1)
##  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.2)
##  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
##  withr         2.5.2   2023-10-30 [1] CRAN (R 4.3.2)
##  xfun          0.41    2023-11-01 [1] CRAN (R 4.3.2)
##  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.1)
## 
##  [1] /home/lluis/bin/R/4.3.1
##  [2] /opt/R/4.3.1/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>CRAN maintained packages</title>
      <link>https://llrs.dev/post/2023/05/03/cran-maintained-packages/</link>
      <pubDate>Wed, 03 May 2023 00:00:00 +0000</pubDate>
      <guid>https://llrs.dev/post/2023/05/03/cran-maintained-packages/</guid>
      <description>


&lt;p&gt;The role of package managers in software is paramount for developers.
In R the CRAN team provides a platform to tests and host packages.
This means ensuring that R dependencies are up to date and software required by some packages are also available in CRAN.&lt;/p&gt;
&lt;p&gt;This helps testing ~20000 packages frequently (daily for most packages) in several architectures and R versions.
In addition, they test updates for compatibility with the dependencies and test and review new packages.&lt;/p&gt;
&lt;p&gt;Most of the work with packages is automated but often requires human intervention (&lt;a href=&#34;https://journal.r-project.org/news/RJ-2022-4-cran/#cran-package-submissions&#34;&gt;50% of the submisions&lt;/a&gt;).
Another consuming activity is keeping up packages abandoned by their original maintainers.&lt;/p&gt;
&lt;p&gt;While newer packages are &lt;a href=&#34;https://llrs.dev/post/2021/12/07/reasons-cran-archivals/&#34;&gt;archived from CRAN often&lt;/a&gt;, some old packages were adopted by CRAN.
The &lt;a href=&#34;https://cran.r-project.org/CRAN_team.htm&#34;&gt;CRAN team&lt;/a&gt; is &lt;a href=&#34;https://mastodon.social/@henrikbengtsson/110186925898457474&#34;&gt;looking for help&lt;/a&gt; maintining those.&lt;/p&gt;
&lt;p&gt;In this post I’ll explore the packages maintained by CRAN.&lt;/p&gt;
&lt;div id=&#34;cran-in-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;CRAN in packages&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;packages_db &amp;lt;- as.data.frame(tools::CRAN_package_db())
cran_author &amp;lt;- grep(&amp;quot;CRAN Team&amp;quot;, x = packages_db$Author, ignore.case = TRUE)
cran_authorsR &amp;lt;- grep(&amp;quot;CRAN Team&amp;quot;, x = packages_db$`Authors@R`, ignore.case = TRUE)
CRAN_TEAM_mentioned &amp;lt;- union(cran_author, cran_authorsR)
unique(packages_db$Package[CRAN_TEAM_mentioned])
## [1] &amp;quot;fBasics&amp;quot;   &amp;quot;fMultivar&amp;quot; &amp;quot;geiger&amp;quot;    &amp;quot;plotrix&amp;quot;   &amp;quot;RCurl&amp;quot;     &amp;quot;RJSONIO&amp;quot;  
## [7] &amp;quot;udunits2&amp;quot;  &amp;quot;XML&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In some of these package the CRAN team appears as contributors because they provided help/code to fix bugs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/package=geiger&#34;&gt;geiger&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/package=fMultivar&#34;&gt;fMultivar&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/package=fBasics&#34;&gt;fBasics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/package=udunits2&#34;&gt;udunits2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In others they are the maintainers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/package=XML&#34;&gt;XML&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/package=RCurl&#34;&gt;RCurl&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cran.r-project.org/package=RJSONIO&#34;&gt;RJSONIO&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;From these three packages RJSONIO is the newest (first release in 2010) and requires less updates (lately 1 or 2 a year).
However, in 2022 RCurl and XML required 4 and 5 updates respectively.
I will focus on these packages as these are the ones they are looking for new maintainers.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;rcurl-and-xml&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;RCurl and XML&lt;/h1&gt;
&lt;div id=&#34;circular-dependency&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Circular dependency&lt;/h2&gt;
&lt;p&gt;Both XML and RCurl depend on each other.&lt;/p&gt;
&lt;p&gt;We can see that the packages are direct dependencies of one of their direct dependencies!
How can be that?
If we go the the &lt;a href=&#34;https://cran.r-project.org/package=RCurl&#34;&gt;RCurl&lt;/a&gt; website we see in “Suggests: XML”, and in the &lt;a href=&#34;https://cran.r-project.org/package=XML&#34;&gt;XML&lt;/a&gt; website the RCurl is there too.
This circular dependency is allowed because they have each other in Suggests.&lt;/p&gt;
&lt;p&gt;A first step to reduce any possible problem would be to separate them.
This would make it easier understanding which package is worth prioritizing and possible missteps will have less impact.&lt;/p&gt;
&lt;p&gt;If we look at &lt;a href=&#34;https://github.com/search?q=repo%3Acran%2FXML%20RCurl&amp;amp;type=code&#34;&gt;XML source code for RCurl we find&lt;/a&gt; some code in &lt;code&gt;inst/&lt;/code&gt; folder.
If these two cases were removed the package could remove its dependency to RCurl.&lt;/p&gt;
&lt;p&gt;Similarly, if we look at &lt;a href=&#34;https://github.com/search?q=repo%3Acran%2FRCurl%20XML&amp;amp;type=code&#34;&gt;RCurl source code for XML we find&lt;/a&gt; some code in &lt;code&gt;inst/&lt;/code&gt; folder and in some examples.
If these three cases were removed the package could remove its dependency to XML.&lt;/p&gt;
&lt;p&gt;RCurl has been &lt;a href=&#34;https://diffify.com/R/RCurl/1.95-4.9/1.98-1.12&#34;&gt;more stable&lt;/a&gt; than XML, which have seen &lt;a href=&#34;https://diffify.com/R/XML/3.98-1.7/3.99-0.14&#34;&gt;new functions added and one removed&lt;/a&gt; since CRAN is maintaining it.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;relevant-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Relevant data&lt;/h2&gt;
&lt;p&gt;We will look at 4 sets of data for each pacakge: &lt;a href=&#34;#dependencies&#34;&gt;dependencies&lt;/a&gt;, &lt;a href=&#34;#releases&#34;&gt;releases&lt;/a&gt;, &lt;a href=&#34;#maintainers&#34;&gt;maintainers&lt;/a&gt; and &lt;a href=&#34;#downloads&#34;&gt;downloads&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;dependencies&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Dependencies&lt;/h3&gt;
&lt;p&gt;Both packages have some system dependencies which might make the maintenance harder.
In addition they have a large number of dependencies.
We can gather the dependencies in CRAN and Bioconductor software packages:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;tools&amp;quot;)
# Look up only software dependencies in Bioconductor
options(repos = BiocManager::repositories()[c(&amp;quot;BioCsoft&amp;quot;, &amp;quot;CRAN&amp;quot;)])
ap &amp;lt;- available.packages()
all_deps &amp;lt;- package_dependencies(c(&amp;quot;RCurl&amp;quot;, &amp;quot;XML&amp;quot;), 
                                 reverse = TRUE, db = ap, which = &amp;quot;all&amp;quot;)
all_unique_deps &amp;lt;- unique(unlist(all_deps, FALSE, FALSE))
first_deps &amp;lt;- package_dependencies(all_unique_deps, db = ap, which = &amp;quot;all&amp;quot;)
first_deps_strong &amp;lt;- package_dependencies(all_unique_deps, db = ap, which = &amp;quot;strong&amp;quot;)
strong &amp;lt;- sapply(first_deps_strong, function(x){any(c(&amp;quot;XML&amp;quot;, &amp;quot;RCurl&amp;quot;) %in% x)})
deps_strong &amp;lt;- package_dependencies(all_unique_deps, recursive = TRUE, 
                                 db = ap, which = &amp;quot;strong&amp;quot;)
first_rdeps &amp;lt;- package_dependencies(all_unique_deps, 
                                   reverse = TRUE, db = ap, which = &amp;quot;all&amp;quot;)
deps_all &amp;lt;- package_dependencies(all_unique_deps, recursive = TRUE, 
                                 db = ap, which = &amp;quot;all&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;They have 495 direct dependencies (and 8 more in annotation packages in Bioconductor: recount3, ENCODExplorerData, UCSCRepeatMasker, gDNAinRNAseqData, qdap, qdapTools, metaboliteIDmapping and curatedBreastData).&lt;/p&gt;
&lt;p&gt;These two packages with their dependencies are used one way or another by around 20000 packages (about 90% of CRAN and Bioconductor)!
If these packages fail the impact on the community will be huge.&lt;/p&gt;
&lt;p&gt;To reduce the impact of the dependencies we should look up the direct dependencies.
But we also looked at the reverse dependencies to asses the impact of the package in the other packages.&lt;/p&gt;
&lt;p&gt;Know which are these, and who maintain them will help decide what is the best course of action.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;releases&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Releases&lt;/h3&gt;
&lt;p&gt;A first approach is looking into the number of releases and dates to asses if the package has an active maintainer or not:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;archive &amp;lt;- tools:::CRAN_archive_db()[all_unique_deps]
packages &amp;lt;- tools::CRAN_package_db()
library(&amp;quot;dplyr&amp;quot;)
library(&amp;quot;BiocPkgTools&amp;quot;)
fr &amp;lt;- vapply(archive, function(x) {
  if (is.null(x)) {
    return(NA)
  }
  as.Date(x$mtime[1])
}, FUN.VALUE = Sys.Date())
fr_bioc &amp;lt;- biocDownloadStats(&amp;quot;software&amp;quot;) |&amp;gt; 
  filter(Package %in% all_unique_deps) |&amp;gt; 
  firstInBioc() |&amp;gt; 
  pull(Date, name = Package)
first_release &amp;lt;- c(as.Date(fr[!is.na(fr)]), as.Date(fr_bioc))[all_unique_deps]
last_update &amp;lt;- packages$Published[match(all_unique_deps, packages$Package)]
releases &amp;lt;- vapply(archive, NROW, numeric(1L)) + 1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We only have information about CRAN packages:&lt;br /&gt;
Bioconductor has two releases every year, and while the maintainers can release patched versions of packages between them that information is not stored (or easily retrieved, they are still available in the &lt;a href=&#34;https://code.bioconductor.org&#34;&gt;git server&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Even if Bioconductor maintainers didn’t modify the package the version number increases with each release.
But the version update in the git doesn’t propagate to users automatically unless their checks pass.
For all these reasons it doesn’t make sense to count releases of packages in Bioconductor.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;maintainers&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Maintainers&lt;/h3&gt;
&lt;p&gt;Now that we know which packages are more active, we can look up for the people behind it.
This way we can prioritize working with maintainers that are known to be active&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;maintainers &amp;lt;- packages_db$Maintainer[match(all_unique_deps, packages_db$Package)]
maintainers &amp;lt;- trimws(gsub(&amp;quot;&amp;lt;.+&amp;gt;&amp;quot;, &amp;quot;&amp;quot;, maintainers))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once again, the Bioconductor repository doesn’t provide a file to gather this kind of data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;downloads&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Downloads&lt;/h3&gt;
&lt;p&gt;Another variable we can use are the downloads from users of said packages.
Probably, packages more downloaded are used more and a breaking change on them will have impact on more people than other packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;cranlogs&amp;quot;)
acd &amp;lt;- cran_downloads(intersect(all_unique_deps, packages_db$Package), 
                          when = &amp;quot;last-month&amp;quot;)
cran_pkg &amp;lt;- summarise(acd, downloads = sum(count), .by = package)
loc &amp;lt;- Sys.setlocale(locale = &amp;quot;C&amp;quot;)
bioc_d &amp;lt;- vapply(setdiff(all_unique_deps, packages_db$Package), function(x){
  pkg &amp;lt;- pkgDownloadStats(x)
  tail(pkg$Nb_of_downloads, 1)
  }, numeric(1L))
bioc_pkg &amp;lt;- data.frame(package = names(bioc_d), downloads = bioc_d)
downloads &amp;lt;- rbind(bioc_pkg, cran_pkg)
rownames(downloads) &amp;lt;- downloads$package
dwn &amp;lt;- downloads[all_unique_deps, ]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The logs are provided by the global mirror of CRAN (sponsored by Rstudio).&lt;br /&gt;
The Bioconductor infrastructure which provides total number of downloads and number of downloads from distinct IPs &lt;a href=&#34;#fn2&#34; class=&#34;footnote-ref&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;analysis&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Analysis&lt;/h2&gt;
&lt;p&gt;We collected the data that might be relevant.
Now, we can start looking all the data gathered:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;repo &amp;lt;- vector(&amp;quot;character&amp;quot;, length(all_unique_deps))
ap_deps &amp;lt;- ap[all_unique_deps, ]
repo[startsWith(ap_deps[, &amp;quot;Repository&amp;quot;], &amp;quot;https://bioc&amp;quot;)] &amp;lt;- &amp;quot;Bioconductor&amp;quot;
repo[!startsWith(ap_deps[, &amp;quot;Repository&amp;quot;], &amp;quot;https://bioc&amp;quot;)] &amp;lt;- &amp;quot;CRAN&amp;quot;
deps &amp;lt;- data.frame(package = all_unique_deps,
                   direct_dep_XML = all_unique_deps %in% all_deps$XML,
                   direct_dep_RCurl = all_unique_deps %in% all_deps$RCurl,
                   first_deps_n = lengths(first_deps),
                   deps_all_n = lengths(deps_all),
                   first_rdeps_n = lengths(first_rdeps),
                   first_deps_strong_n = lengths(first_deps_strong), 
                   deps_strong_n = lengths(deps_strong),
                   direct_strong = strong, 
                   releases = releases,
                   strong = strong, 
                   first_release = first_release,
                   last_release = last_update,
                   maintainer = maintainers,
                   downloads = dwn$downloads,
                   repository = repo) |&amp;gt; 
  mutate(type = case_when(direct_dep_XML &amp;amp; direct_dep_RCurl ~ &amp;quot;both&amp;quot;,
                          direct_dep_XML ~ &amp;quot;XML&amp;quot;,
                          direct_dep_RCurl ~ &amp;quot;RCurl&amp;quot;))
rownames(deps) &amp;lt;- NULL
head(deps)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;9%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;2%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;package&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;direct_dep_XML&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;direct_dep_RCurl&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;first_deps_n&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;deps_all_n&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;first_rdeps_n&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;first_deps_strong_n&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;deps_strong_n&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;direct_strong&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;releases&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;strong&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;first_release&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;last_release&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;maintainer&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;downloads&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;repository&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;AnnotationForge&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;26&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2456&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;47&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2012-02-01&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8113&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Bioconductor&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;both&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;AnnotationHubData&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;33&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2456&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;26&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;136&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2015-02-01&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6619&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Bioconductor&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;both&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;autonomics&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;FALSE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;61&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2499&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;34&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;104&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;FALSE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;FALSE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-01&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;91&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Bioconductor&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;RCurl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;BaseSpaceR&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;FALSE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2456&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2013-02-01&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;218&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Bioconductor&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;RCurl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;BayesSpace&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;FALSE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;34&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2459&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;24&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;161&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-02-01&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;221&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Bioconductor&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;RCurl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;BgeeDB&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;FALSE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2457&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;14&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;71&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2016-02-01&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;238&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Bioconductor&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;RCurl&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I added some numbers and logical values that might help exploring this data.&lt;/p&gt;
&lt;p&gt;We will look at the &lt;a href=&#34;#distribution-dependencies&#34;&gt;packages dependencies between RCurl and XML&lt;/a&gt;, some plots to have a &lt;a href=&#34;#overview&#34;&gt;quick view&lt;/a&gt;&lt;/p&gt;
&lt;div id=&#34;distribution-dependencies&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Distribution dependencies&lt;/h3&gt;
&lt;p&gt;Let’s see how many packages depend in each of them:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;deps |&amp;gt; 
  summarise(Packages = n(), deps = sum(first_deps_n),
            q25 = quantile(deps_all_n, probs = 0.25),
            mean_all = mean(deps_all_n),
            q75 = quantile(deps_all_n, probs = 0.75),
            .by = c(direct_dep_XML, direct_dep_RCurl)) |&amp;gt; 
  arrange(-Packages)&lt;/code&gt;&lt;/pre&gt;
&lt;table style=&#34;width:100%;&#34;&gt;
&lt;colgroup&gt;
&lt;col width=&#34;22%&#34; /&gt;
&lt;col width=&#34;25%&#34; /&gt;
&lt;col width=&#34;13%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;col width=&#34;13%&#34; /&gt;
&lt;col width=&#34;10%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;direct_dep_XML&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;direct_dep_RCurl&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Packages&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;deps&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;q25&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mean_all&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;q75&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;FALSE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;235&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3584&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2456&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2365.596&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2458.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;FALSE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;193&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3187&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2456&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2320.855&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2460.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TRUE&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;67&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1216&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2456&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2423.119&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2457.5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There are ~40 more packages depending on XML than to RCurl and just 67 to both of them.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;overview&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Overview&lt;/h3&gt;
&lt;p&gt;We can plot some variables to get a quick overview of the packages:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;ggplot2&amp;quot;)
library(&amp;quot;ggrepel&amp;quot;)
deps_wo &amp;lt;- filter(deps, !package %in% c(&amp;quot;XML&amp;quot;, &amp;quot;RCurl&amp;quot;))
deps_wo |&amp;gt; 
  ggplot() +
  geom_point(aes(first_deps_n, downloads, shape = type)) +
  geom_text_repel(aes(first_deps_n, downloads, label = package),
                  data = filter(deps_wo, first_deps_n &amp;gt; 40 | downloads &amp;gt; 10^5)) +
  theme_minimal() +
  scale_y_log10(labels = scales::label_log()) +
  labs(title = &amp;quot;Packages and downloads&amp;quot;, 
       x = &amp;quot;Direct dependencies&amp;quot;, y = &amp;quot;Downloads&amp;quot;, size = &amp;quot;Packages&amp;quot;)
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:plot1&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://llrs.dev/post/2023/05/03/cran-maintained-packages/index.en_files/figure-html/plot1-1.png&#34; alt=&#34;Direct dependencies vs downloads. Many pakcages have up to 50 packages and most have below 1000 downloads in a month.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Direct dependencies vs downloads. Many pakcages have up to 50 packages and most have below 1000 downloads in a month.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;There is an outlier on &lt;a href=&#34;#fig:plot1&#34;&gt;1&lt;/a&gt;, the mlr package has more than 10k downloads and close to 120 direct dependencies, but down to less than 15 strong dependencies !&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;deps_wo |&amp;gt; 
  ggplot() +
  geom_point(aes(first_deps_n, first_rdeps_n, shape = type)) +
  geom_text_repel(aes(first_deps_n, first_rdeps_n, label = package),
                  data = filter(deps_wo, first_deps_n &amp;gt; 60 | first_rdeps_n &amp;gt; 50)) +
  theme_minimal() +
  scale_y_log10(labels = scales::label_log()) +
  labs(title = &amp;quot;Few dependencies but lots of dependents&amp;quot;,
    x = &amp;quot;Direct dependencies&amp;quot;, y = &amp;quot;Depend on them&amp;quot;, size = &amp;quot;Packages&amp;quot;)
## Warning: Transformation introduced infinite values in continuous y-axis
## Transformation introduced infinite values in continuous y-axis&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:plot2&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://llrs.dev/post/2023/05/03/cran-maintained-packages/index.en_files/figure-html/plot2-1.png&#34; alt=&#34;Dependencies vs packages that depend on them. &#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 2: Dependencies vs packages that depend on them.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;In general though, the packages that have more dependencies have less direct dependencies.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;ggplot2&amp;quot;)
library(&amp;quot;ggrepel&amp;quot;)
deps_wo &amp;lt;- filter(deps, !package %in% c(&amp;quot;XML&amp;quot;, &amp;quot;RCurl&amp;quot;))
deps_wo |&amp;gt; 
  ggplot() +
  geom_vline(xintercept = 20, linetype = 2) +
  geom_point(aes(first_deps_strong_n, downloads, shape = repository)) +
  geom_text_repel(aes(first_deps_strong_n, downloads, label = package),
                  data = filter(deps_wo, first_deps_strong_n &amp;gt; 20 | downloads &amp;gt; 10^5)) +
  theme_minimal() +
  scale_y_log10(labels = scales::label_log()) +
  labs(title = &amp;quot;Packages and downloads&amp;quot;, 
       x = &amp;quot;Direct strong dependencies&amp;quot;, y = &amp;quot;Downloads&amp;quot;, shape = &amp;quot;Repository&amp;quot;)
## Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:plot3&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://llrs.dev/post/2023/05/03/cran-maintained-packages/index.en_files/figure-html/plot3-1.png&#34; alt=&#34;Direct strong dependencies vs downloads. Many pakcages have more than 20 direct imports.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 3: Direct strong dependencies vs downloads. Many pakcages have more than 20 direct imports.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;One observable effect is that many packages do not comply with current CRAN rules of having 20 strong dependencies (as &lt;a href=&#34;https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#index-_005fR_005fCHECK_005fEXCESSIVE_005fIMPORTS_005f&#34;&gt;described in R-internals&lt;/a&gt;).
This suggests that these CRAN packages are old or that this limit is not checked in packages updates.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data_maintainers &amp;lt;- deps_wo |&amp;gt; 
  filter(!is.na(maintainer)) |&amp;gt; 
  summarize(n = n(), downloads = sum(downloads), .by = maintainer)
data_maintainers |&amp;gt; 
  ggplot() +
  geom_point(aes(n, downloads)) +
  geom_text_repel(aes(n, downloads, label = maintainer),
                  data = filter(data_maintainers, n &amp;gt; 2 | downloads &amp;gt; 10^4)) +
  scale_y_log10(labels = scales::label_log()) +
  scale_x_continuous(breaks = 1:10, minor_breaks = NULL) +
  theme_minimal() +
  labs(title = &amp;quot;CRAN maintainers that depend on XML and RCurl&amp;quot;,
       x = &amp;quot;Packages&amp;quot;, y = &amp;quot;Downloads&amp;quot;)
## Warning: ggrepel: 15 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:plot-maintainers&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://llrs.dev/post/2023/05/03/cran-maintained-packages/index.en_files/figure-html/plot-maintainers-1.png&#34; alt=&#34;Looking at maintainers and the number of downloads they have.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 4: Looking at maintainers and the number of downloads they have.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Most maintainer have few packages, some highly used packages but some have many packages relatively highly used.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;finding-important-packages&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Finding important packages&lt;/h3&gt;
&lt;p&gt;We can use a PCA to find which packages are more important.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cols_pca &amp;lt;-  c(4:7, 15)
pca_all &amp;lt;- prcomp(deps_wo[, cols_pca], scale. = TRUE, center = TRUE)
summary(pca_all)
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.386 1.2478 0.9458 0.65380 0.44846
## Proportion of Variance 0.384 0.3114 0.1789 0.08549 0.04022
## Cumulative Proportion  0.384 0.6954 0.8743 0.95978 1.00000
pca_data &amp;lt;- cbind(pca_all$x, deps_wo)
ggplot(pca_data) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  geom_point(aes(PC1, PC2, col = repository, shape = repository)) +
  geom_text_repel(aes(PC1, PC2, label = package), 
                  data = filter(pca_data, abs(PC1) &amp;gt; 2 | abs(PC2) &amp;gt; 2)) +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  labs(title = &amp;quot;PCA of the numeric variables&amp;quot;)
## Warning: ggrepel: 58 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:pca-all&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://llrs.dev/post/2023/05/03/cran-maintained-packages/index.en_files/figure-html/pca-all-1.png&#34; alt=&#34;PCA of all packages.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 5: PCA of all packages.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;We can see in the first PCA some packages that have many downloads and/or depend on many packages.
The second one are packages with many dependencies, as explained by &lt;code&gt;rotation&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_all$rotation[, 1:2]&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC1&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;first_deps_n&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.6521642&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.1528947&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;deps_all_n&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.3304698&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0549046&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;first_rdeps_n&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1235972&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.6948659&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;first_deps_strong_n&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.6606765&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0750116&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;downloads&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1170554&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.6965223&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;But more important is that are packages that are named in &lt;a href=&#34;#fig:pca-all&#34;&gt;5&lt;/a&gt;, there is the RUnit package, markdown and rgeos that have high number of downloads and many package depend on them one way or another.&lt;/p&gt;
&lt;p&gt;However we can focus on packages that without RCurl or XML wouldn’t work:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_strong &amp;lt;- prcomp(deps_wo[deps_wo$strong, cols_pca], 
                     scale. = TRUE, center = TRUE)
summary(pca_strong)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.4198 1.3005 0.9373 0.49421 0.41258
## Proportion of Variance 0.4032 0.3382 0.1757 0.04885 0.03404
## Cumulative Proportion  0.4032 0.7414 0.9171 0.96596 1.00000
pca_data_strong &amp;lt;- cbind(pca_strong$x, deps_wo[deps_wo$strong, ])
ggplot(pca_data_strong) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  geom_point(aes(PC1, PC2, col = repository, shape = repository)) +
    geom_text_repel(aes(PC1, PC2, label = package), 
                  data = filter(pca_data_strong, abs(PC1) &amp;gt; 2 | abs(PC2) &amp;gt; 2)) +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  labs(title = &amp;quot;Important packages depending on XML and RCurl&amp;quot;, 
       subtitle = &amp;quot;PCA of numeric variables of strong dependencies&amp;quot;,
       col = &amp;quot;Repository&amp;quot;, shape = &amp;quot;Repository&amp;quot;)
## Warning: ggrepel: 42 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:pca-strong&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://llrs.dev/post/2023/05/03/cran-maintained-packages/index.en_files/figure-html/pca-strong-1.png&#34; alt=&#34;PCA of packages with strong dependency to XML or RCurl.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 6: PCA of packages with strong dependency to XML or RCurl.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The main packages that depend on XML and RCurl are from Biocondcutor, followed by mlr and rlist.
rlist has as dependency XML and only uses 3 functions from it.
mlr uses 5 different functions from XML.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_weak &amp;lt;- prcomp(deps_wo[!deps_wo$strong, cols_pca], 
                   scale. = TRUE, center = TRUE)
summary(pca_weak)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.4500 1.1578 0.9901 0.63980 0.40895
## Proportion of Variance 0.4205 0.2681 0.1960 0.08187 0.03345
## Cumulative Proportion  0.4205 0.6886 0.8847 0.96655 1.00000
pca_data_weak &amp;lt;- cbind(pca_weak$x, deps_wo[!deps_wo$strong, ])
ggplot(pca_data_weak) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  geom_point(aes(PC1, PC2, col = type, shape = type)) +
  geom_text_repel(aes(PC1, PC2, label = package), 
                  data = filter(pca_data_weak, abs(PC1)&amp;gt; 2 | abs(PC2) &amp;gt; 2)) +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  labs(title = &amp;quot;PCA of packages in CRAN&amp;quot;, col = &amp;quot;Type&amp;quot;, shape = &amp;quot;Type&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:pca-weak&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://llrs.dev/post/2023/05/03/cran-maintained-packages/index.en_files/figure-html/pca-weak-1.png&#34; alt=&#34;Packages with weak dependency to XML or RCurl.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 7: Packages with weak dependency to XML or RCurl.
&lt;/p&gt;
&lt;/div&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;keep &amp;lt;- deps_wo$repository == &amp;quot;CRAN&amp;quot; &amp;amp; deps_wo$strong
pca_cran &amp;lt;- prcomp(deps_wo[keep, cols_pca], 
                     scale. = TRUE, center = TRUE)
summary(pca_cran)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.4174 1.3060 0.9244 0.51813 0.40278
## Proportion of Variance 0.4018 0.3412 0.1709 0.05369 0.03245
## Cumulative Proportion  0.4018 0.7430 0.9139 0.96755 1.00000
pca_data_strong &amp;lt;- cbind(pca_cran$x, deps_wo[keep, ])
ggplot(pca_data_strong) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  geom_point(aes(PC1, PC2, col = type, shape = type)) +
    geom_text_repel(aes(PC1, PC2, label = package), 
                  data = filter(pca_data_strong, abs(PC1) &amp;gt; 2 | abs(PC2) &amp;gt; 2)) +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  labs(title = &amp;quot;Packages in CRAN&amp;quot;, 
       col = &amp;quot;Type&amp;quot;, shape = &amp;quot;Type&amp;quot;)
## Warning: ggrepel: 26 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:pca-cran&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://llrs.dev/post/2023/05/03/cran-maintained-packages/index.en_files/figure-html/pca-cran-1.png&#34; alt=&#34;PCA of packages on CRAN.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 8: PCA of packages on CRAN.
&lt;/p&gt;
&lt;/div&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;keep &amp;lt;- deps_wo$repository == &amp;quot;Bioconductor&amp;quot;  &amp;amp; deps_wo$strong
pca_bioc &amp;lt;- prcomp(deps_wo[keep, cols_pca], 
                     scale. = TRUE, center = TRUE)
summary(pca_bioc)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.4913 1.3703 0.8495 0.33584 0.25281
## Proportion of Variance 0.4448 0.3755 0.1443 0.02256 0.01278
## Cumulative Proportion  0.4448 0.8203 0.9647 0.98722 1.00000
pca_data_strong &amp;lt;- cbind(pca_bioc$x, deps_wo[keep, ])
ggplot(pca_data_strong) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  geom_point(aes(PC1, PC2, col = type, shape = type)) +
    geom_text_repel(aes(PC1, PC2, label = package), 
                  data = filter(pca_data_strong, abs(PC1) &amp;gt; 2 | abs(PC2) &amp;gt; 2)) +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  labs(title = &amp;quot;Packages in Bioconductor&amp;quot;, 
       subtitle = &amp;quot;PCA of numeric variables of strong dependencies&amp;quot;,
       col = &amp;quot;Type&amp;quot;, shape = &amp;quot;Type&amp;quot;)
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:pca-bioc&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://llrs.dev/post/2023/05/03/cran-maintained-packages/index.en_files/figure-html/pca-bioc-1.png&#34; alt=&#34;PCA of packages on Bioconductor.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 9: PCA of packages on Bioconductor.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;GenomeInfoDb is the package that seems more important that only uses the &lt;code&gt;RCurl::getURL&lt;/code&gt; function.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;outro&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Outro&lt;/h2&gt;
&lt;p&gt;I wanted to explore a bit how these packages got into this position &lt;a href=&#34;#fn3&#34; class=&#34;footnote-ref&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;deps |&amp;gt; 
  filter(strong) |&amp;gt; 
  ggplot() +
  geom_vline(xintercept = as.Date(&amp;quot;2013-06-15&amp;quot;), linetype = 2) +
  geom_point(aes(first_release, downloads, col = type, shape = type, 
                 size = first_deps_strong_n)) +
  geom_label(aes(first_release, downloads, label = package),
             data = filter(deps, package %in% c(&amp;quot;XML&amp;quot;, &amp;quot;RCurl&amp;quot;)), show.legend = FALSE) +
  theme_minimal() +
  scale_y_log10(labels = scales::label_log()) +
  annotate(&amp;quot;text&amp;quot;, x = as.Date(&amp;quot;2014-6-15&amp;quot;), y = 5*10^5, 
           label = &amp;quot;CRAN maintained&amp;quot;, hjust = 0) +
  labs(x = &amp;quot;Release date&amp;quot;, y = &amp;quot;Downloads&amp;quot;, 
       title = &amp;quot;More packages added after CRAN maintenance than before&amp;quot;,
       subtitle = &amp;quot;Release date and downloads&amp;quot;,
       col = &amp;quot;Depends on&amp;quot;, shape = &amp;quot;Depends on&amp;quot;, size = &amp;quot;Direct strong dependencies&amp;quot;) 
## Warning: Removed 34 rows containing missing values (`geom_point()`).&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:deps-time&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://llrs.dev/post/2023/05/03/cran-maintained-packages/index.en_files/figure-html/deps-time-1.png&#34; alt=&#34;First release of packages in relation to the maintenance by CRAN of XML and RCurl.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 10: First release of packages in relation to the maintenance by CRAN of XML and RCurl.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Almost the CRAN team have been maintaining these packages longer than the previous maintainer(s?).&lt;/p&gt;
&lt;p&gt;Next, we look at the dependencies added after CRAN started maintaining them&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summarize(deps_wo,
          before = sum(first_release &amp;lt;= as.Date(&amp;quot;2013-06-15&amp;quot;), na.rm = TRUE), 
          later = sum(first_release &amp;gt; as.Date(&amp;quot;2013-06-15&amp;quot;), na.rm = TRUE),
          .by = type)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;type&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;before&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;later&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;both&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;14&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;52&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;RCurl&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;XML&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;63&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;156&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;More packages have been released after CRAN is maintaining it than before.
Maybe packages authors trusted the CRAN team for their dependencies or there was no other alternative for the functionality.
This might also be explained by the expansion of CRAN (and Bioconductor) with more packages being added each day.
However, this places further pressure in the CRAN team to maintain those packages. Removing this burden might free more time for them or to dedicate to CRAN.&lt;/p&gt;
&lt;p&gt;A replacement for XML could be &lt;a href=&#34;https://cran.r-project.org/package=xml2&#34;&gt;xml2&lt;/a&gt;, first released in 2015 (which uses the same system dependency libxml2).&lt;br /&gt;
A replacement for RCurl could be &lt;a href=&#34;https://cran.r-project.org/package=curl&#34;&gt;curl&lt;/a&gt;, first released at the end of 2014 (which uses the same system dependency libcurl).&lt;/p&gt;
&lt;p&gt;Until their release there were no other replacement for these packages (if there are other packages, please let me know).
It is not clear to me if those packages at their first release could replace XML and RCurl.&lt;/p&gt;
&lt;p&gt;This highlight the importance of correct replacement of packages in the community.
Recent examples are the efforts taken by the &lt;a href=&#34;https://r-spatial.org/&#34;&gt;spatial community&lt;/a&gt; led by Roger Bivand, Edzer Pebesma.
Where packages have been carefully designed and planned to replace older packages that are going to be retired soon.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;recomendations&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Recomendations&lt;/h1&gt;
&lt;p&gt;As a final recommendations I think:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Disentangle the XML and RCurl circular dependency.&lt;/li&gt;
&lt;li&gt;Evaluate if the xml2 and curl packages provides enough functionality to replace XML and RCurl respectively.
If not see what should be added to these packages or how to develop alternative packages to fill the gap if needed.&lt;br /&gt;
Maybe a helping documentation about the alternative from XML and RCurl could be written to ease the transition and evaluate if the functionality is covered by these packages.&lt;/li&gt;
&lt;li&gt;Contact package maintainers to replace the functionality they currently depend on XML and RCurl as seen in &lt;a href=&#34;#fig:plot-maintainers&#34;&gt;4&lt;/a&gt; and the maintainers of packages seen in figures &lt;a href=&#34;#fig:pca-all&#34;&gt;5&lt;/a&gt;, &lt;a href=&#34;#fig:pca-strong&#34;&gt;6&lt;/a&gt;, &lt;a href=&#34;#fig:pca-cran&#34;&gt;8&lt;/a&gt;, and &lt;a href=&#34;#fig:pca-bioc&#34;&gt;9&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Set deprecation warnings on the XML and RCurl packages.&lt;/li&gt;
&lt;li&gt;Archive XML and RCurl packages in CRAN.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This might take years of moving packages around but I am confident that once the word is out, package developers will avoid XML and RCurl and current maintainers that depend on them will replace them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;On 2024/01/22 the &lt;a href=&#34;https://stat.ethz.ch/pipermail/r-package-devel/2024q1/010359.html&#34;&gt;CRAN team asked for a maintainer of XML&lt;/a&gt;&lt;/p&gt;
&lt;div id=&#34;reproducibility&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Reproducibility&lt;/h3&gt;
&lt;details&gt;
&lt;pre&gt;&lt;code&gt;## - Session info ---------------------------------------------------------------
##  setting  value
##  version  R version 4.3.1 (2023-06-16)
##  os       Ubuntu 22.04.3 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  C
##  ctype    C
##  tz       Europe/Madrid
##  date     2024-01-22
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## - Packages -------------------------------------------------------------------
##  package       * version     date (UTC) lib source
##  Biobase         2.62.0      2023-10-24 [1] Bioconductor
##  BiocFileCache   2.10.1      2023-10-26 [1] Bioconductor
##  BiocGenerics    0.48.1      2023-11-01 [1] Bioconductor
##  BiocManager     1.30.22     2023-08-08 [1] CRAN (R 4.3.1)
##  BiocPkgTools  * 1.20.0      2023-10-24 [1] Bioconductor
##  biocViews       1.70.0      2023-10-24 [1] Bioconductor
##  bit             4.0.5       2022-11-15 [1] CRAN (R 4.3.1)
##  bit64           4.0.5       2020-08-30 [1] CRAN (R 4.3.1)
##  bitops          1.0-7       2021-04-24 [1] CRAN (R 4.3.1)
##  blob            1.2.4       2023-03-17 [1] CRAN (R 4.3.1)
##  blogdown        1.18        2023-06-19 [1] CRAN (R 4.3.1)
##  bookdown        0.37        2023-12-01 [1] CRAN (R 4.3.1)
##  bslib           0.6.1       2023-11-28 [1] CRAN (R 4.3.1)
##  cachem          1.0.8       2023-05-01 [1] CRAN (R 4.3.1)
##  cli             3.6.2       2023-12-11 [1] CRAN (R 4.3.1)
##  codetools       0.2-19      2023-02-01 [2] CRAN (R 4.3.1)
##  colorspace      2.1-0       2023-01-23 [1] CRAN (R 4.3.1)
##  cranlogs      * 2.1.1       2019-04-29 [1] CRAN (R 4.3.1)
##  crul            1.4.0       2023-05-17 [1] CRAN (R 4.3.1)
##  curl            5.2.0       2023-12-08 [1] CRAN (R 4.3.1)
##  DBI             1.2.1       2024-01-12 [1] CRAN (R 4.3.1)
##  dbplyr          2.4.0       2023-10-26 [1] CRAN (R 4.3.2)
##  digest          0.6.34      2024-01-11 [1] CRAN (R 4.3.1)
##  dplyr         * 1.1.4       2023-11-17 [1] CRAN (R 4.3.1)
##  DT              0.31        2023-12-09 [1] CRAN (R 4.3.1)
##  evaluate        0.23        2023-11-01 [1] CRAN (R 4.3.2)
##  fansi           1.0.6       2023-12-08 [1] CRAN (R 4.3.1)
##  farver          2.1.1       2022-07-06 [1] CRAN (R 4.3.1)
##  fastmap         1.1.1       2023-02-24 [1] CRAN (R 4.3.1)
##  fauxpas         0.5.2       2023-05-03 [1] CRAN (R 4.3.1)
##  filelock        1.0.3       2023-12-11 [1] CRAN (R 4.3.1)
##  generics        0.1.3       2022-07-05 [1] CRAN (R 4.3.1)
##  ggplot2       * 3.4.4       2023-10-12 [1] CRAN (R 4.3.1)
##  ggrepel       * 0.9.5       2024-01-10 [1] CRAN (R 4.3.1)
##  gh              1.4.0       2023-02-22 [1] CRAN (R 4.3.1)
##  glue            1.7.0       2024-01-09 [1] CRAN (R 4.3.1)
##  graph           1.80.0      2023-10-24 [1] Bioconductor
##  gtable          0.3.4       2023-08-21 [1] CRAN (R 4.3.1)
##  highr           0.10        2022-12-22 [1] CRAN (R 4.3.1)
##  hms             1.1.3       2023-03-21 [1] CRAN (R 4.3.1)
##  htmltools       0.5.7       2023-11-03 [1] CRAN (R 4.3.2)
##  htmlwidgets   * 1.6.4       2023-12-06 [1] CRAN (R 4.3.1)
##  httpcode        0.3.0       2020-04-10 [1] CRAN (R 4.3.1)
##  httr            1.4.7       2023-08-15 [1] CRAN (R 4.3.1)
##  igraph          1.6.0       2023-12-11 [1] CRAN (R 4.3.1)
##  jquerylib       0.1.4       2021-04-26 [1] CRAN (R 4.3.1)
##  jsonlite        1.8.8       2023-12-04 [1] CRAN (R 4.3.1)
##  knitr         * 1.45        2023-10-30 [1] CRAN (R 4.3.2)
##  labeling        0.4.3       2023-08-29 [1] CRAN (R 4.3.2)
##  lifecycle       1.0.4       2023-11-07 [1] CRAN (R 4.3.2)
##  magrittr        2.0.3       2022-03-30 [1] CRAN (R 4.3.1)
##  memoise         2.0.1       2021-11-26 [1] CRAN (R 4.3.1)
##  munsell         0.5.0       2018-06-12 [1] CRAN (R 4.3.1)
##  pillar          1.9.0       2023-03-22 [1] CRAN (R 4.3.1)
##  pkgconfig       2.0.3       2019-09-22 [1] CRAN (R 4.3.1)
##  purrr           1.0.2       2023-08-10 [1] CRAN (R 4.3.1)
##  R6              2.5.1       2021-08-19 [1] CRAN (R 4.3.1)
##  RBGL            1.78.0      2023-10-24 [1] Bioconductor
##  Rcpp            1.0.12      2024-01-09 [1] CRAN (R 4.3.1)
##  RCurl           1.98-1.14   2024-01-09 [1] CRAN (R 4.3.1)
##  readr           2.1.5       2024-01-10 [1] CRAN (R 4.3.1)
##  rlang           1.1.3       2024-01-10 [1] CRAN (R 4.3.1)
##  rmarkdown       2.25        2023-09-18 [1] CRAN (R 4.3.1)
##  rorcid          0.7.0       2021-01-20 [1] CRAN (R 4.3.1)
##  RSQLite         2.3.5       2024-01-21 [1] CRAN (R 4.3.1)
##  rstudioapi      0.15.0      2023-07-07 [1] CRAN (R 4.3.1)
##  RUnit           0.4.32      2018-05-18 [1] CRAN (R 4.3.1)
##  rvest           1.0.3       2022-08-19 [1] CRAN (R 4.3.1)
##  sass            0.4.8       2023-12-06 [1] CRAN (R 4.3.1)
##  scales          1.3.0       2023-11-28 [1] CRAN (R 4.3.1)
##  sessioninfo     1.2.2       2021-12-06 [1] CRAN (R 4.3.1)
##  stringi         1.8.3       2023-12-11 [1] CRAN (R 4.3.1)
##  stringr         1.5.1       2023-11-14 [1] CRAN (R 4.3.1)
##  tibble          3.2.1       2023-03-20 [1] CRAN (R 4.3.1)
##  tidyselect      1.2.0       2022-10-10 [1] CRAN (R 4.3.1)
##  tzdb            0.4.0       2023-05-12 [1] CRAN (R 4.3.1)
##  utf8            1.2.4       2023-10-22 [1] CRAN (R 4.3.2)
##  vctrs           0.6.5       2023-12-01 [1] CRAN (R 4.3.1)
##  whisker         0.4.1       2022-12-05 [1] CRAN (R 4.3.1)
##  withr           3.0.0       2024-01-16 [1] CRAN (R 4.3.1)
##  xfun            0.41        2023-11-01 [1] CRAN (R 4.3.2)
##  XML             3.99-0.16.1 2024-01-22 [1] CRAN (R 4.3.1)
##  xml2            1.3.6       2023-12-04 [1] CRAN (R 4.3.1)
##  yaml            2.3.8       2023-12-11 [1] CRAN (R 4.3.1)
## 
##  [1] /home/lluis/bin/R/4.3.1
##  [2] /opt/R/4.3.1/lib/R/library
## 
## ------------------------------------------------------------------------------&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes footnotes-end-of-document&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;the &lt;code&gt;maintainer&lt;/code&gt; function only works for installed packages, and I don’t have all these packages installed.&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;Both logs only count those of their repository and not from other mirrors or approaches (RSPM, bspm, r2u, ….).&lt;a href=&#34;#fnref2&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;I recently found this as opposite of introduction/intro.&lt;a href=&#34;#fnref3&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Reasons why packages are archived on CRAN</title>
      <link>https://llrs.dev/post/2021/12/07/reasons-cran-archivals/</link>
      <pubDate>Tue, 07 Dec 2021 00:00:00 +0000</pubDate>
      <guid>https://llrs.dev/post/2021/12/07/reasons-cran-archivals/</guid>
      <description>


&lt;p&gt;On the Repositories working group of the R Consortium Rich FitzJohn posted &lt;a href=&#34;https://github.com/RConsortium/r-repositories-wg/issues/8#issuecomment-979486806&#34;&gt;a comment&lt;/a&gt; to &lt;a href=&#34;https://cran.r-project.org/src/contrib/PACKAGES.in&#34;&gt;a file&lt;/a&gt; that seems to be were the CRAN team stores and uses to check the package history.&lt;/p&gt;
&lt;p&gt;The structure is not defined anywhere I could find (I haven’t looked much to be honest).&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Package: &amp;lt;package name&amp;gt;
X-CRAN-Comment: Archived on YYYY-MM-DD as &amp;lt;reason&amp;gt;.
X-CRAN-History: Archived on YYYY-MM-DD as &amp;lt;reason&amp;gt;.
  Unarchived on YYYY-MM-DD.
  .
  &amp;lt;Optional clarification of archival reason&amp;gt;
&amp;lt;Optional fields like License_restricts_use, Replaced_by, Maintainer: ORPHANED, OS_type: unix&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I think the X-CRAN-Comment is what appears on the website of an archived package, like on &lt;a href=&#34;https://cran.r-project.org/package=radix&#34;&gt;radix package&lt;/a&gt;. However, other comments on the website do not appear on that file.&lt;/p&gt;
&lt;p&gt;In addition, the file doesn’t have some records of archiving and unarchiving of some packages, but there are old records from 2013 or before to now. But we can use it to see understand what are the &lt;em&gt;reasons&lt;/em&gt; of archiving packages, which seems to be the main purpose of the file.&lt;/p&gt;
&lt;div id=&#34;the-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The data&lt;/h1&gt;
&lt;p&gt;First step is read the record.
As it seems that it has some &lt;code&gt;key: value&lt;/code&gt; structure similar to DESCRIPTION file of packages it seems it is a DCF format: Debian Control File format which is easy to read with the &lt;code&gt;read.dcf&lt;/code&gt; function.&lt;/p&gt;
&lt;div id=&#34;exploring&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Exploring&lt;/h2&gt;
&lt;p&gt;A brief exploration of the data:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
comment
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
history
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
packages
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3612
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2345
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
434
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
70
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Many packages have either comments or history but relatively few both.
I’m not sure when either of them is used, as I would expect that all that have history would have a comment.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Replaced_by
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
packages
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
6360
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
101
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Many packages are simply replaced by some other package.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Maintainer
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
packages
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
6366
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
95
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Most of the packages that have a Maintainer field are orphaned/archived.
Does it mean that all the others are not orphaned?&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;extracting-reasons&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Extracting reasons&lt;/h2&gt;
&lt;p&gt;Now that it is in R data structure, we can extract the relevant information, dates, type of action and reasons for each archivation event.
I use &lt;code&gt;strcapture&lt;/code&gt; for this task with a regex to extract the action, the date and the explanation it migh have.&lt;/p&gt;
&lt;p&gt;I don’t know how the file is written probably it is a mix of automated tools and manual editing so there isn’t a simple way to collect all the information in a structured way.
Simply because the structure has been changing along the years as well as the details of what is stored has changed, or there are missing events.
However, the extracted information should be enough for our purposes.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Action
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Events
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
archived
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
7096
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
orphaned
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
341
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
removed
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
113
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
renamed
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
replaced
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
unarchived
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2973
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;As expected the most common recorded event are archivations, but there are some orphaned packages and even some removed packages.
Also note the number of orphaned packages is greater than those with the Maintainer field, supporting my theory that the format has changed and that this shouldn’t be taken as an exhaustive and complete analysis of archivations.&lt;/p&gt;
&lt;p&gt;How are they along time?&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/12/07/reasons-cran-archivals/index.en_files/figure-html/plots_df-1.png&#34; width=&#34;864&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Even if there are some events recorded from 2009 it seems that this file has been more used more recently (last commit related to &lt;a href=&#34;https://github.com/wch/r-source/blame/trunk/src/library/tools/R/QC.R#L7778&#34;&gt;this was on 2015&lt;/a&gt;).
I know that there are some old events not recorded on the file, because there are some packages currently present on CRAN that they had been archived but do not have an unarchived action, so conversely it could happen.
So, this doesn’t necessarily mean that there are currently more packages archived from CRAN. But it is a clear indication that now at least there is a more accurate record of archived packages on this file.&lt;/p&gt;
&lt;p&gt;Another source of records of archived packages might be &lt;a href=&#34;http://dirk.eddelbuettel.com/cranberries/cran/removed/&#34;&gt;cranberries&lt;/a&gt;. It would be nice to compare this file with the records on the database there.&lt;/p&gt;
&lt;p&gt;Now that most of the package events are collected and we have the reasons of the actions, we can explore and classify the reasons.
Using some simple regex I explore for key words or sentences.&lt;/p&gt;
&lt;p&gt;We can look at the most frequent error reasons for archiving packages, patterns I found with more than 100 cases:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/12/07/reasons-cran-archivals/index.en_files/figure-html/reasons_top-1.png&#34; width=&#34;864&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The most frequent error is that errors are not corrected or checks, even when there are reminders.&lt;br /&gt;
Next are the packages archived because they depend on other packages already not on CRAN.&lt;br /&gt;
There are some packages that are replaced by others and some maintainers might not want to continue supporting the package when they receive a message from CRAN about fixing an error.&lt;/p&gt;
&lt;p&gt;Policy violation makes to the top 5 but with less than 500 events.
Dependencies problems are the sixth cause, followed by email errors (bouncing, incorrect email…) and then come very sporadic problems about license, not fixing on updates of R, authorship problems or requests from authors.&lt;/p&gt;
&lt;p&gt;Some of these errors happen at the same time for each event, but grouping these reasons together we get a similar table:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
package_not_corrected
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
request_maintainer
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
dependencies
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
other
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
events
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4366
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1530
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
767
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
374
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
15
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
13
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Surprisingly the second most frequent group of archiving actions are due to many different reasons.
This is probably the &lt;a href=&#34;https://en.wikipedia.org/wiki/Pareto_principle&#34;&gt;Pareto’s principle&lt;/a&gt; in action because they are around 15% of the archiving events but the causes are very diverse between them.&lt;/p&gt;
&lt;p&gt;However, if we look at the packages which were archived (not at the request of maintainers), most of them just happen once:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Events
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
packages
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5304
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
594
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
115
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
31
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
8
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This suggests that once a package is archived maintainers do not make the effort to put it back on CRAN except on very few cases were there are multiple attempts.
To check we can see the current available packages and see how many of those are still present on CRAN:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
CRAN
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Packages
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Proportion
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
no
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3869
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
64%
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
yes
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2183
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
36%
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Many packages are currently on CRAN despite their past archivation but close to 64% are currently not on CRAN.&lt;/p&gt;
&lt;p&gt;Almost all that are on CRAN have now no &lt;code&gt;X-CRAN-Comment&lt;/code&gt;, except for a few:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
Package
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
X-CRAN-Comment
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
geiger
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
&lt;p&gt;Orphaned and corrected on 2022-05-09.&lt;/p&gt;
Repeated notifications about USE_FC_LEN_T were ignored.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
alphahull
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Versions up to 2.3 have been removed for mirepresentation of authorship.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
udunits2
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Orphaned on 2022-01-06 as installation problems were not corrected.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
bibtex
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Orphaned and corrected on 2020-09-19 as check problems were not corrected in time.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;CRAN team might have missed these few packages and didn’t move the comments to X-CRAN-history.&lt;/p&gt;
&lt;p&gt;There are some packages that are not archived that don’t have a CRAN-history happens too, but they usually have other fields changed.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;discussion&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Discussion&lt;/h1&gt;
&lt;p&gt;Most packages archived on CRAN are due to the maintainers not correcting errors found on the package by CRAN checks.
It is clear that the checks that CRAN help packages to have a high quality but it has high cost on the maintainers and specially on CRAN team.
Maintainers don’t seem to have enough time to fix the issues on time.
And the CRAN team sends personalized reminders to maintainers and sometimes patches to the packages.&lt;/p&gt;
&lt;p&gt;Although the desire to have packages corrected and with no issues is the common goal there are few options on light of these:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Be more restrictive&lt;/p&gt;
&lt;p&gt;Prevent a package to be accepted if it breaks dependencies or archive packages when they fail checks.
This will make it harder to keep packages on CRAN but would lift some pressure on the CRAN team.
This would go against the current on other languages repositories, which often they don’t check the packages/modules and even have less restrictions on dependencies (so it might be an unpopular decision).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Be more permissive:&lt;/p&gt;
&lt;p&gt;One option would be to allow for more time for maintainers to fix issues. I haven’t find any report of how long does it take for a package since an error to a fix on CRAN but often it is quite long.
I have seen packages with a warning for months if not years and they weren’t archived from CRAN.&lt;/p&gt;
&lt;p&gt;Maybe if users get a warning on installing packages that a package or one of its dependencies is not clear on all CRAN checks (without error or warnings).
This might help to make users more conscious of their dependencies but this might add pressure to maintainers who already don’t have enough time to fix the problems of their packages.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Provide more help or tools to maintainers&lt;/p&gt;
&lt;p&gt;Another option is to provide a mechanism for maintainers to receive help or fix the package.
Currently CRAN requires that new packages that break dependencies to give enough notice in advance to other maintainers to fix their package.
On &lt;a href=&#34;https://stat.ethz.ch/mailman/listinfo/r-package-devel&#34;&gt;R-pkg-devel mailing list&lt;/a&gt; there are often requests for help on submitting and fixing some errors detected by CRAN checks which often result on other maintainers sharing their solutions for the same problem.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There high percentage of packages that once archived do not come back to CRAN might be a good place to start helping maintainers and an opportunity for users to step in and help maintainers of packages they have been using.
There is need for something else? How would that work?&lt;/p&gt;
&lt;p&gt;At the same time it is admirable that after so many years there are few errors on the data.
However, the archival process might be a good process to automate, providing the reason on the webpage and add it to X-CRAN-Comment and moving the comments to X-CRAN-History once it is unarchived.
Knowing more about how these actions are performed by the CRAN team and how the community could help on the process will be beneficial to all.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This blog was updated on 2022/01/02 to improve the parsing of actions and dates on packages. Resulting on a change on the first plot to include unarchived which slightly modified the second plot of reasons why packages are archived. This overall only affected the numbers of the plots not the conclusions or discussion.&lt;/p&gt;
&lt;div id=&#34;reproducibility&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Reproducibility&lt;/h3&gt;
&lt;details&gt;
&lt;pre&gt;&lt;code&gt;## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.2.0 (2022-04-22)
##  os       Ubuntu 20.04.4 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Madrid
##  date     2022-05-09
##  pandoc   2.17.1.1 @ /usr/lib/rstudio/bin/quarto/bin/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package      * version date (UTC) lib source
##  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
##  blogdown       1.9     2022-03-28 [1] CRAN (R 4.2.0)
##  bookdown       0.26    2022-04-15 [1] CRAN (R 4.2.0)
##  bslib          0.3.1   2021-10-06 [1] CRAN (R 4.2.0)
##  cli            3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
##  colorspace     2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
##  ComplexUpset * 1.3.3   2021-12-11 [1] CRAN (R 4.2.0)
##  crayon         1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
##  DBI            1.1.2   2021-12-20 [1] CRAN (R 4.2.0)
##  digest         0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
##  dplyr        * 1.0.9   2022-04-28 [1] CRAN (R 4.2.0)
##  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
##  evaluate       0.15    2022-02-18 [1] CRAN (R 4.2.0)
##  fansi          1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
##  farver         2.1.0   2021-02-28 [1] CRAN (R 4.2.0)
##  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
##  generics       0.1.2   2022-01-31 [1] CRAN (R 4.2.0)
##  ggplot2      * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
##  glue           1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
##  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.2.0)
##  highr          0.9     2021-04-16 [1] CRAN (R 4.2.0)
##  htmltools      0.5.2   2021-08-25 [1] CRAN (R 4.2.0)
##  jquerylib      0.1.4   2021-04-26 [1] CRAN (R 4.2.0)
##  jsonlite       1.8.0   2022-02-22 [1] CRAN (R 4.2.0)
##  knitr          1.39    2022-04-26 [1] CRAN (R 4.2.0)
##  labeling       0.4.2   2020-10-20 [1] CRAN (R 4.2.0)
##  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
##  magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
##  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
##  patchwork      1.1.1   2020-12-17 [1] CRAN (R 4.2.0)
##  pillar         1.7.0   2022-02-01 [1] CRAN (R 4.2.0)
##  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
##  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
##  R6             2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
##  rlang          1.0.2   2022-03-04 [1] CRAN (R 4.2.0)
##  rmarkdown      2.14    2022-04-25 [1] CRAN (R 4.2.0)
##  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.2.0)
##  sass           0.4.1   2022-03-23 [1] CRAN (R 4.2.0)
##  scales         1.2.0   2022-04-13 [1] CRAN (R 4.2.0)
##  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
##  stringi        1.7.6   2021-11-29 [1] CRAN (R 4.2.0)
##  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
##  tibble         3.1.7   2022-05-03 [1] CRAN (R 4.2.0)
##  tidyselect     1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
##  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
##  vctrs          0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
##  withr          2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
##  xfun           0.30    2022-03-02 [1] CRAN (R 4.2.0)
##  yaml           2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
## 
##  [1] /home/lluis/bin/R/4.2.0/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Packages submission and reviews; how does it work?</title>
      <link>https://llrs.dev/talk/user-2021/</link>
      <pubDate>Sat, 16 Oct 2021 00:00:00 +0000</pubDate>
      <guid>https://llrs.dev/talk/user-2021/</guid>
      <description>
&lt;script src=&#34;https://llrs.dev/talk/user-2021/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;The abstract I presented to be accepted was:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We benefit from others’ work on R and also by shared packages and for our programming tasks. Occasionally we might generate some piece of software that we want to share with the community. Usually sharing our work with the R community means submitting a package to an archive (CRAN, Bioconductor or others). While each individual archive has some rules they share some common principles.&lt;/p&gt;
&lt;p&gt;If your package follows their rules about the submission process and has a good quality according to their rules it will be included. All submissions have some common sections: First, an initial screening; second, a more profound manual review of the code. Then, if the suggestions are applied or correctly replied then the package is included in the archive.&lt;/p&gt;
&lt;p&gt;On each step some rules and criteria are used to decide if the package moves forward or not. Understanding what these rules say, common problems and comments from reviewers will help avoiding submitting a package to get it rejected. Reducing the friction between sharing our work, providing useful packages to the community and minimizing reviewers’ time and efforts.&lt;/p&gt;
&lt;p&gt;Looking at the review process of three archives of R packages, CRAN, Bioconductor and rOpenSci, I’ll explain common rules, patterns, timelines and checks required to get the package included, as well as personal anecdotes with them. The talk is based on the post analyzing reviews available here: &lt;a href=&#34;https://llrs.dev/tags/reviews/&#34; class=&#34;uri&#34;&gt;https://llrs.dev/tags/reviews/&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I received excellent feedback from the reviewers and I got a full talk (I asked for a poster because I was nervous to present to a big audience).&lt;/p&gt;
&lt;p&gt;This talk also received one of the Accessibility Awards.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>CRAN review</title>
      <link>https://llrs.dev/post/2021/01/31/cran-review/</link>
      <pubDate>Sun, 31 Jan 2021 00:00:00 +0000</pubDate>
      <guid>https://llrs.dev/post/2021/01/31/cran-review/</guid>
      <description>
&lt;script src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;I’ve been doing some &lt;a href=&#34;https://llrs.dev/tags/reviews/&#34;&gt;analysis on the review submissions&lt;/a&gt; of several projects of R.
However, till recently I couldn’t analyze the CRAN submission.
There was &lt;a href=&#34;https://github.com/lockedata/cransays&#34;&gt;cransays&lt;/a&gt;’ package to check package submissions which on the online documentation provided a &lt;a href=&#34;https://lockedata.github.io/cransays/articles/dashboard.html&#34;&gt;dashboard&lt;/a&gt; which updated each hour.
Since 2020/09/12 the status of the queues and folders of submissions are saved on a branch.
Using this information and basing in &lt;a href=&#34;https://github.com/tjtnew/newbies&#34;&gt;script provided by Tim Taylor&lt;/a&gt; I’ll check how are the submissions on CRAN handled.&lt;/p&gt;
&lt;p&gt;I’ll look at the &lt;a href=&#34;#cran-load&#34;&gt;CRAN queue&lt;/a&gt;, I’ll explore some &lt;a href=&#34;#time-patterns&#34;&gt;time patterns&lt;/a&gt; and also check the meaning of those &lt;a href=&#34;#subfolder&#34;&gt;subfolder&lt;/a&gt;.
Later I’ll go to a more &lt;a href=&#34;#information-for-submitters&#34;&gt;practical information&lt;/a&gt; for people submitting a package.
Lastly, we’ll see how hard is the job of the CRAN team by looking at the reliability of the &lt;a href=&#34;#GHAR&#34;&gt;Github action used&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Before all this we need preliminary work to download the data and clean it:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Downloading the cransays repository branch history
download.file(&amp;quot;https://github.com/lockedata/cransays/archive/history.zip&amp;quot;, 
              destfile = &amp;quot;static/cransays-history.zip&amp;quot;)
path_zip &amp;lt;- here::here(&amp;quot;static&amp;quot;, &amp;quot;cransays-history.zip&amp;quot;) 
# We unzip the files to read them
dat &amp;lt;- unzip(path_zip, exdir = &amp;quot;static&amp;quot;)
csv &amp;lt;- dat[grepl(&amp;quot;*.csv$&amp;quot;, x = dat)]
f &amp;lt;- lapply(csv, read.csv)
m &amp;lt;- function(x, y) {
  merge(x, y, sort = FALSE, all = TRUE)
}
updates &amp;lt;- Reduce(m, f) # Merge all files (Because the file format changed)
write.csv(updates, file = &amp;quot;static/cran_till_now.csv&amp;quot;,  row.names = FALSE)
# Clean up
unlink(&amp;quot;static/cransays-history/&amp;quot;, recursive = TRUE)
unlink(&amp;quot;static/cransays-history.zip&amp;quot;, recursive = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once we have the data we can load it, and we load the libraries used for the analysis:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;tidyverse&amp;quot;)
library(&amp;quot;lubridate&amp;quot;)
library(&amp;quot;hms&amp;quot;)
path_file &amp;lt;- here::here(&amp;quot;static&amp;quot;, &amp;quot;cran_till_now.csv&amp;quot;)
cran_submissions &amp;lt;- read.csv(path_file)
theme_set(theme_minimal()) # For plotting
col_names &amp;lt;- c(&amp;quot;package&amp;quot;, &amp;quot;version&amp;quot;, &amp;quot;snapshot_time&amp;quot;, &amp;quot;folder&amp;quot;, &amp;quot;subfolder&amp;quot;)
cran_submissions &amp;lt;- cran_submissions[, col_names]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The period we are going to analyze is from the beginning of the records till 2021-01-30.
It includes some well earned holiday time for the CRAN team, during which submissions were not possible.&lt;/p&gt;
&lt;p&gt;I’ve read some comments on the inconsistencies of where the holidays of the CRAN teams are reported and I couldn’t find it for previous years.&lt;/p&gt;
&lt;p&gt;For the 4 months we are analyzing which only has one holiday period I used a &lt;a href=&#34;https://twitter.com/krlmlr/status/1346005787668336640&#34;&gt;screenshot&lt;/a&gt; found on twitter.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;holidays &amp;lt;- data.frame(
  start = as.POSIXct(&amp;quot;18/12/2020&amp;quot;, format = &amp;quot;%d/%m/%Y&amp;quot;, tz = &amp;quot;UTC&amp;quot;), 
  end = as.POSIXct(&amp;quot;04/01/2021&amp;quot;, format = &amp;quot;%d/%m/%Y&amp;quot;, tz = &amp;quot;UTC&amp;quot;)
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that we know the holidays and have them in a single data.frame it’s time to explore and clean the data collected:&lt;/p&gt;
&lt;div id=&#34;cleaning-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Cleaning the data&lt;/h2&gt;
&lt;p&gt;After preparing the files in one big file we can load and work with it.
First steps, check the data and prepare it for what we want:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Use appropriate class
cran_submissions$snapshot_time &amp;lt;- as.POSIXct(cran_submissions$snapshot_time,
                                             tz = &amp;quot;UTC&amp;quot;)
# Fix subfolders structure
cran_submissions$subfolder[cran_submissions$subfolder %in% c(&amp;quot;&amp;quot;, &amp;quot;/&amp;quot;)] &amp;lt;- NA
# Remove files or submissions without version number
cran_submissions &amp;lt;- cran_submissions[!is.na(cran_submissions$version), ]
cran_submissions &amp;lt;- distinct(cran_submissions, 
                             snapshot_time, folder, package, version, subfolder,
                             .keep_all = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After loading and a preliminary cleanup, set date format, homogenize folder format, remove submissions that are not packages (yes there are pdf and other files on the queue), and remove duplicates we can start.&lt;/p&gt;
&lt;p&gt;As always start with some checks of the data.
Note: I should follow more often this advice myself, as this is the last section I’m writing on the post.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;packges_multiple_versions &amp;lt;- cran_submissions %&amp;gt;% 
  group_by(package, snapshot_time) %&amp;gt;% 
  summarize(n = n_distinct(version)) %&amp;gt;% 
  filter(n != 1) %&amp;gt;% 
  distinct(package) %&amp;gt;% 
  pull(package)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are 92 packages with multiple versions at the same time on CRAN queue.&lt;/p&gt;
&lt;p&gt;Perhaps because packages are left in different folders (2 or even 3) at the same time:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;package_multiple &amp;lt;- cran_submissions %&amp;gt;% 
  group_by(snapshot_time, package) %&amp;gt;% 
  count() %&amp;gt;% 
  group_by(snapshot_time) %&amp;gt;% 
  count(n) %&amp;gt;% 
  filter(n != 1) %&amp;gt;% 
  summarise(n = sum(nn)) %&amp;gt;% 
  ungroup()
ggplot(package_multiple) +
  geom_point(aes(snapshot_time, n), size = 1) +
  geom_rect(data = holidays, aes(xmin = start, xmax = end, ymin = 0, ymax = 6),
            alpha = 0.25, fill = &amp;quot;red&amp;quot;) +
  annotate(&amp;quot;text&amp;quot;, x = holidays$start + (holidays$end - holidays$start)/2, 
           y = 3.5, label = &amp;quot;CRAN holidays&amp;quot;) +
  scale_x_datetime(date_labels = &amp;quot;%Y/%m/%d&amp;quot;, date_breaks = &amp;quot;2 weeks&amp;quot;, 
                   expand = expansion()) +
  scale_y_continuous(expand = expansion()) +
  labs(title = &amp;quot;Packages in multiple folders and subfolders&amp;quot;, 
       x = element_blank(), y = element_blank())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/package-multiple-folders-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This happens in 1915 snapshots of 3260, probably due to the manual labor of the CRAN reviews.
I don’t really know the cause of this, it could be an error on the script recording the data, copying the data around the server.
But perhaps this indicates further improvements and automatization of the process can be done.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_submissions &amp;lt;- cran_submissions %&amp;gt;% 
  arrange(package, snapshot_time, version, folder) %&amp;gt;% 
  group_by(package, snapshot_time) %&amp;gt;% 
  mutate(n = 1:n()) %&amp;gt;% 
  filter(n == n()) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  select(-n)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we removed ~3500 records of packages with two versions remaining on the queue.
Next we check packages in multiple folders but with the same version and remove them until we are left with a single one (assuming there aren’t parallel steps on the review process followed):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_submissions &amp;lt;- cran_submissions %&amp;gt;% 
  arrange(package, snapshot_time, folder) %&amp;gt;% 
  group_by(package, snapshot_time) %&amp;gt;% 
  mutate(n = 1:n()) %&amp;gt;% 
  filter(n == n()) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  select(-n)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Last, we add the number of submissions, in this period, for each package:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;diff0 &amp;lt;- structure(0, class = &amp;quot;difftime&amp;quot;, units = &amp;quot;hours&amp;quot;)
cran_submissions &amp;lt;- cran_submissions %&amp;gt;% 
  arrange(package, version, snapshot_time) %&amp;gt;% 
  group_by(package) %&amp;gt;% 
  # Packages last seen in queue less than 24 ago are considered same submission
  mutate(diff_time = difftime(snapshot_time,  lag(snapshot_time), units = &amp;quot;hour&amp;quot;),
         diff_time = if_else(is.na(diff_time), diff0, diff_time), # Fill NAs
         diff_v = version != lag(version),
         diff_v = ifelse(is.na(diff_v), TRUE, diff_v), # Fill NAs
         new_version = !near(diff_time, 1, tol = 24) &amp;amp; diff_v, 
         new_version = if_else(new_version == FALSE &amp;amp; diff_time == 0, 
                               TRUE, new_version),
         submission_n = cumsum(as.numeric(new_version))) %&amp;gt;%
  ungroup() %&amp;gt;% 
  select(-diff_time, -diff_v, -new_version)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As sometimes after a release there is soon a fast update to fix some bugs raised on the new features introduced, if a package isn’t seen in 24h in the queue it is considered a new submission.
Also if it change the package version but not if it change the version while addressing the feedback from reviewers.&lt;/p&gt;
&lt;p&gt;Now we have the data ready for further analysis.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;cran-load&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;CRAN load&lt;/h2&gt;
&lt;p&gt;We all know that CRAN is busy with updates to fix bugs, improve package, or with petitions to have new packages included on the repository.&lt;/p&gt;
&lt;p&gt;A first plot we can make is showing the number of distinct packages on each moment:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_queue &amp;lt;- cran_submissions %&amp;gt;% 
  group_by(snapshot_time) %&amp;gt;% 
  summarize(n = n_distinct(package))
ggplot(cran_queue) +
  geom_rect(aes(xmin = start, xmax = end, ymin = 0, ymax = 230),
            alpha = 0.5, fill = &amp;quot;red&amp;quot;, data = holidays) +
  annotate(&amp;quot;text&amp;quot;, x = holidays$start + (holidays$end - holidays$start)/2, 
           y = 150, label = &amp;quot;CRAN holidays&amp;quot;) +
  geom_path(aes(snapshot_time, n)) +
  scale_x_datetime(date_labels = &amp;quot;%Y/%m/%d&amp;quot;, date_breaks = &amp;quot;2 weeks&amp;quot;, 
                   expand = expansion()) +
  scale_y_continuous(expand = expansion()) +
  labs(x = element_blank(), y = element_blank(), 
       title = &amp;quot;Packages on CRAN review process&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/cran-queues-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can see that there are some ups and downs, and ranges between 200 to 50.&lt;/p&gt;
&lt;p&gt;There are some instance were the number of package on the queue have a sudden drop and then a recovery to previous levels.
This as far as I know is a visual artifact.&lt;/p&gt;
&lt;p&gt;We can also see that people do not tend to rush and push the package before the holidays.
But clearly there is some build up of submissions after holidays, as the the highest number of packages on the queue is reached after holidays.&lt;/p&gt;
&lt;p&gt;On the CRAN review process classifying package in folders seems to be part of the process:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;man_colors &amp;lt;- RColorBrewer::brewer.pal(8, &amp;quot;Dark2&amp;quot;)
names(man_colors) &amp;lt;- unique(cran_submissions$folder)
cran_submissions %&amp;gt;% 
  group_by(folder, snapshot_time) %&amp;gt;% 
  summarize(packages = n_distinct(package)) %&amp;gt;% 
  ggplot() +
  geom_rect(data = holidays, aes(xmin = start, xmax = end, ymin = 0, ymax = 200),
            alpha = 0.25, fill = &amp;quot;red&amp;quot;) +
  annotate(&amp;quot;text&amp;quot;, x = holidays$start + (holidays$end - holidays$start)/2, 
           y = 105, label = &amp;quot;CRAN holidays&amp;quot;) +
  geom_path(aes(snapshot_time, packages, col = folder)) +
  scale_x_datetime(date_labels = &amp;quot;%Y/%m/%d&amp;quot;, date_breaks = &amp;quot;2 weeks&amp;quot;, 
                   expand = expansion()) +
  scale_y_continuous(expand = expansion()) +
  scale_color_manual(values = man_colors) +
  labs(x = element_blank(), y = element_blank(),
       title = &amp;quot;Packages by folder&amp;quot;, col = &amp;quot;Folder&amp;quot;) +
  theme(legend.position = c(0.6, 0.7))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/cran-submissions-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The queue trend is mostly driven by newbies folder (which ranges between 25 and 150) and after holidays by the pretest folder.&lt;/p&gt;
&lt;p&gt;Surprisingly when the queue is split by folder we don’t see those sudden drops.
This might indicate that there is a clean up on some of the folders&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;.
What we clearly see is a clean up on holidays for all the folders, when almost all was cleared up.&lt;/p&gt;
&lt;p&gt;Also the pretest seems to be before the newbies folder raises, so it seems like these tests are done done only to new pacakges.&lt;/p&gt;
There other folders do not have such an increase
&lt;details&gt;
&lt;summary&gt;
after the holidays .
&lt;/summary&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_submissions %&amp;gt;% 
  group_by(folder, snapshot_time) %&amp;gt;% 
  summarize(packages = n_distinct(package)) %&amp;gt;% 
  filter(snapshot_time &amp;gt;= holidays$start) %&amp;gt;% 
  ggplot() +
  geom_path(aes(snapshot_time, packages, col = folder)) +
  geom_rect(data = holidays, aes(xmin = start, xmax = end, ymin = 0, ymax = 200),
            alpha = 0.25, fill = &amp;quot;red&amp;quot;) +
  annotate(&amp;quot;text&amp;quot;, x = holidays$start + (holidays$end - holidays$start)/2, 
           y = 105, label = &amp;quot;CRAN holidays&amp;quot;) +
  scale_x_datetime(date_labels = &amp;quot;%Y/%m/%d&amp;quot;, date_breaks = &amp;quot;1 day&amp;quot;, 
                   expand = expansion()) +
  scale_y_continuous(expand = expansion(), limits = c(0, NA)) +
  scale_color_manual(values = man_colors) +
  labs(x = element_blank(), y = element_blank(),
       title = &amp;quot;Holidays&amp;quot;, col = &amp;quot;Folder&amp;quot;) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = c(0.8, 0.7))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/cran-holidays-zoom-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It seems like that on the 31st there was a clean up of some packages on the waiting list.
And we can see the increase of submissions on the first week of January, as described previously.&lt;/p&gt;
&lt;/details&gt;
&lt;/div&gt;
&lt;div id=&#34;time-patterns&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Time patterns&lt;/h2&gt;
&lt;p&gt;Some people has expressed they try to submit to CRAN when there are few packages on the queue.
Thus, looking when does this low moments happens could be relevant.
We can look for patterns on the queue:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#day-month&#34;&gt;Day of the month&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#day-week&#34;&gt;Day of the week&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note: I have few to none experience with time series, so the following plots are just using the defaults of `geom_smooth`, just omitting the holidays.&lt;/p&gt;
&lt;div id=&#34;day-month&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;By day of the month&lt;/h3&gt;
&lt;p&gt;Looking at folder:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_times &amp;lt;- cran_submissions %&amp;gt;% 
  mutate(seconds = seconds(snapshot_time),
         month = month(snapshot_time),
         mday = mday(snapshot_time),
         wday = wday(snapshot_time, locale = &amp;quot;en_GB.UTF-8&amp;quot;),
         week = week(snapshot_time),
         date = as_date(snapshot_time))
cran_times %&amp;gt;% 
  arrange(folder, date, mday) %&amp;gt;% 
  filter(snapshot_time &amp;lt; holidays$start | snapshot_time  &amp;gt; holidays$end) %&amp;gt;% 
  group_by(folder, date, mday) %&amp;gt;% 
  summarize(packages = n_distinct(package),
            week = unique(week)) %&amp;gt;% 
  group_by(folder, mday) %&amp;gt;% 
  ggplot() +
  geom_smooth(aes(mday, packages, col = folder)) +
  labs(x = &amp;quot;Day of the month&amp;quot;, y = &amp;quot;Packages&amp;quot;, col = &amp;quot;Folder&amp;quot;,
       title = &amp;quot;Evolution by month day&amp;quot;) +
  scale_color_manual(values = man_colors) +
  coord_cartesian(ylim = c(0, NA), xlim = c(1, NA)) +
  scale_x_continuous(expand = expansion()) +
  scale_y_continuous(expand = expansion()) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/cran-monthly-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;At the beginning and end of the month there is more variation on several folders (This could also be that there isn’t information of the end of December and beginning of January).
There seems to be an increase of &lt;strong&gt;new packages submissions towards the beginning of the month&lt;/strong&gt; and later and increase on the newbies folder by the middle of the month.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;day-week&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;By day of the week&lt;/h3&gt;
&lt;p&gt;I first thought about this, because I was curious if there is more submission on weekends (when aficionados and open source developers might have more time) or the rest of the week.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_times %&amp;gt;% 
  filter(snapshot_time &amp;lt; holidays$start | snapshot_time  &amp;gt; holidays$end) %&amp;gt;% 
  group_by(folder, date, wday) %&amp;gt;% 
  summarize(packages = n_distinct(package),
            week = unique(week)) %&amp;gt;% 
  group_by(folder, wday) %&amp;gt;% 
  ggplot() +
  geom_smooth(aes(wday, packages, col = folder)) +
  labs(x = &amp;quot;Day of the week&amp;quot;, y = &amp;quot;Packages&amp;quot;, col = &amp;quot;Folder&amp;quot;,
       title = &amp;quot;Evolution by week day&amp;quot;) +
  scale_color_manual(values = man_colors) +
  scale_x_continuous(breaks = 1:7, expand = expansion()) +
  scale_y_continuous(expand = expansion(), limits = c(0, NA))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/cran-wday-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We see a &lt;strong&gt;rise towards the middle of the week&lt;/strong&gt; of the packages on the pretest folder, indicating new packages submissions.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;other-folders&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Other folders&lt;/h3&gt;
&lt;p&gt;There are some folders that seem to be from &lt;a href=&#34;https://www.r-project.org/contributors.html&#34;&gt;R Contributors&lt;/a&gt;.
We see that some packages are on these folders:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_members &amp;lt;- c(&amp;quot;LH&amp;quot;, &amp;quot;GS&amp;quot;, &amp;quot;JH&amp;quot;)
cran_times %&amp;gt;% 
  filter(subfolder %in% cran_members) %&amp;gt;% 
  group_by(subfolder, snapshot_time) %&amp;gt;% 
  summarize(packages = n_distinct(package)) %&amp;gt;% 
  ggplot() +
  geom_smooth(aes(snapshot_time, packages, col = subfolder)) +
    labs(x = element_blank(), y = element_blank(), col = &amp;quot;Folder&amp;quot;,
       title = &amp;quot;Packages on folders&amp;quot;) +
  scale_y_continuous(expand = expansion(), breaks = 0:10) +
  coord_cartesian(y = c(0, NA))  +
  scale_x_datetime(date_labels = &amp;quot;%Y/%m/%d&amp;quot;, date_breaks = &amp;quot;2 weeks&amp;quot;, 
               expand = expansion(add = 2)) +
  theme(legend.position = c(0.1, 0.8))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/subfolder-pattern-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;There doesn’t seem to be any rule about using those folders or, the work was so quick that the hourly updated data didn’t record it.&lt;/p&gt;
&lt;details&gt;
&lt;summary&gt;
Looking for any temporal pattern on those folders isn’t worth it.
&lt;/summary&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_times %&amp;gt;% 
  filter(subfolder %in% cran_members) %&amp;gt;% 
  group_by(subfolder, mday) %&amp;gt;% 
  summarize(packages = n_distinct(package)) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  ggplot() +
  geom_smooth(aes(mday, packages, col = subfolder)) +
  labs(x = &amp;quot;Day of the month&amp;quot;, y = &amp;quot;Pacakges&amp;quot;, col = &amp;quot;Subfolder&amp;quot;,
       title = &amp;quot;Packages on subfolers by day of the month&amp;quot;) +
  scale_y_continuous(expand = expansion()) +
  scale_x_continuous(expand = expansion(), breaks = c(1,7,14,21,29)) +
  coord_cartesian(ylim = c(0, NA))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/subfolder-mday-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Low number of packages and great variability (except on those that just have 1 package on the folder) on day of month.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_times %&amp;gt;% 
  filter(subfolder %in% cran_members) %&amp;gt;% 
  group_by(subfolder, wday) %&amp;gt;% 
  summarize(packages = n_distinct(package)) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  ggplot() +
  geom_smooth(aes(wday, packages, col = subfolder)) +
  labs(x = &amp;quot;Day of the week&amp;quot;, y = &amp;quot;Pacakges&amp;quot;, col = &amp;quot;Subfolder&amp;quot;,
       title = &amp;quot;Evolution by week day&amp;quot;) +
  scale_y_continuous(expand = expansion()) +
  scale_x_continuous(breaks = 1:7, expand = expansion()) +
  coord_cartesian(ylim =  c(0, NA))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/subfolder-wday-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;There seem to be only 2 people usually working with their folders.
Suppose that there aren’t a common set of rules the reviewers follow.&lt;/p&gt;
&lt;/details&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;information-for-submitters&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Information for submitters&lt;/h2&gt;
&lt;p&gt;I’ve read lots of comments recently around CRAN submission.
However, with the few data available compared to open reviews on &lt;a href=&#34;https://llrs.dev/2020/07/bioconductor-submissions-reviews/&#34; title=&#34;Analysis of Bioconductor reviews&#34;&gt;Bioconductor&lt;/a&gt; and &lt;a href=&#34;https://llrs.dev/2020/09/ropensci-submissions/&#34; title=&#34;Analysis of Bioconductor reviews&#34;&gt;rOpenSci&lt;/a&gt; it is hard to answer them (See those related posts).
On Bioconductor and rOpenSci it is possible to see people involved, message from the reviewers and other interested parties, steps done to be accepted…&lt;/p&gt;
&lt;p&gt;One of the big question we can provide information about with the data available is how long it will be a package on the queue:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;subm &amp;lt;- cran_times %&amp;gt;%
  arrange(snapshot_time) %&amp;gt;% 
  select(package, version, submission_n, snapshot_time) %&amp;gt;% 
  group_by(package, submission_n) %&amp;gt;% 
  filter(row_number() %in% c(1, last(row_number()))) %&amp;gt;% 
  arrange(package, submission_n)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are 429 packages that are only seen once.
It might mean that it is an abandoned, delayed or rejected submissions, other might be acceptances in less than an hour&lt;a href=&#34;#fn2&#34; class=&#34;footnote-ref&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If we look at the package submission by date we can see the quick increase of packages of packages:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rsubm &amp;lt;- subm %&amp;gt;% 
  filter(n_distinct(snapshot_time) %% 2 == 0) %&amp;gt;%
  select(-version) %&amp;gt;% 
  mutate(time = c(&amp;quot;start&amp;quot;, &amp;quot;end&amp;quot;)) %&amp;gt;% 
  pivot_wider(values_from = snapshot_time, names_from = time) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  mutate(r = row_number(), 
         time  =  round(difftime(end, start, units = &amp;quot;hour&amp;quot;), 0)) %&amp;gt;% 
  ungroup()
lv &amp;lt;- levels(fct_reorder(rsubm$package, rsubm$start, .fun = min, .desc = FALSE))
ggplot(rsubm) +
  geom_rect(data = holidays, aes(xmin = start, xmax = end), 
            ymin = first(lv), ymax = last(lv), alpha = 0.5, fill = &amp;quot;red&amp;quot;) +
  geom_linerange(aes(y = fct_reorder(package, start, .fun = min, .desc = FALSE),
                      x = start, xmin = start, xmax = end, 
                     col = as.factor(submission_n))) + 
  labs(x = element_blank(), y = element_blank(), title = 
         &amp;quot;Packages on the queue&amp;quot;, col = &amp;quot;Submissions&amp;quot;) +
  scale_x_datetime(date_labels = &amp;quot;%Y/%m/%d&amp;quot;, date_breaks = &amp;quot;2 weeks&amp;quot;, 
                   expand = expansion(add = 2)) +
  scale_colour_viridis_d() +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank(),
        axis.text.y = element_blank(),
        legend.position = c(0.15, 0.7))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/resubm-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Some packages were submitted more than 5 times in this period. Recall the definition for submission used: a package with different version number after 24 hours or which wasn’t seen in the queue for the last 24 hours (even if they have the same version number).&lt;/p&gt;
&lt;p&gt;Some authors do change the version number when CRAN reviewers require changes before accepting the package on CRAN while others do not and only change the version number according to their release cycle.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rsubm %&amp;gt;% 
  arrange(start) %&amp;gt;% 
  filter(start &amp;lt; holidays$start, # Look only before the holidays
    submission_n == 1,# Use only the first submission
    start &amp;gt; min(start)) %&amp;gt;%   # Just new submissions
  mutate(r = row_number(),
         start1 = as.numeric(seconds(start))) %&amp;gt;% 
  lm(start1 ~ r, data = .) %&amp;gt;% 
  broom::tidy() %&amp;gt;%  
  mutate(estimate = estimate/(60*60)) # Hours
## # A tibble: 2 x 5
##   term         estimate std.error statistic p.value
##   &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
## 1 (Intercept) 444325.     6321.     253047.       0
## 2 r                1.03      4.76      779.       0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;More or less there is a &lt;strong&gt;new package submission every hour&lt;/strong&gt; on CRAN.
Despite this submission rate we can see that most submissions are on the queue a short time:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;patchwork&amp;quot;)
p1 &amp;lt;- rsubm %&amp;gt;% 
  group_by(package) %&amp;gt;% 
  summarize(time = sum(time)) %&amp;gt;% 
  ggplot() +
  geom_histogram(aes(time), bins = 100) +
  labs(title = &amp;quot;Packages total time on queue&amp;quot;, x = &amp;quot;Hours&amp;quot;, 
       y = element_blank()) +
  scale_x_continuous(expand = expansion()) +
  scale_y_continuous(expand = expansion())
p2 &amp;lt;- rsubm %&amp;gt;% 
  group_by(package) %&amp;gt;% 
  summarize(time = sum(time)) %&amp;gt;% 
  ggplot() +
  geom_histogram(aes(time), binwidth = 24) +
  coord_cartesian(xlim = c(0, 24*7)) +
  labs(subtitle = &amp;quot;Zoom&amp;quot;, x = &amp;quot;Hours&amp;quot;, y = element_blank()) +
  scale_x_continuous(expand = expansion(), breaks = seq(0, 24*7, by = 24)) +
  scale_y_continuous(expand = expansion()) +
  theme(panel.background = element_rect(colour = &amp;quot;white&amp;quot;))
p1 + inset_element(p2, 0.2, 0.2, 1, 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/package-time-queue-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The accuracy of this data is not great.
Because I found some packages that remained on the submission queue, and thus picked up by cransays, even after acceptance, so this might be a bit overestimated.
Also, there are packages that got a speedy submission that didn’t last more than an hour, and they weren’t included.&lt;/p&gt;
&lt;p&gt;Looking at the recorded submissions might be more accurate:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p1 &amp;lt;- rsubm %&amp;gt;% 
  group_by(package, submission_n) %&amp;gt;% 
  summarize(time = sum(time)) %&amp;gt;% 
  ggplot() +
  geom_histogram(aes(time), bins = 100) +
  labs(title = &amp;quot;Submission time on queue&amp;quot;, x = &amp;quot;Hours&amp;quot;, 
       y = element_blank()) +
  scale_x_continuous(expand = expansion()) +
  scale_y_continuous(expand = expansion())
p2 &amp;lt;- rsubm %&amp;gt;% 
  group_by(package, submission_n) %&amp;gt;% 
  summarize(time = sum(time)) %&amp;gt;%  
  ggplot() +
  geom_histogram(aes(time), binwidth = 24) +
  coord_cartesian(xlim = c(0, 24*7)) +
  labs(subtitle = &amp;quot;Zoom&amp;quot;, x = &amp;quot;Hours&amp;quot;, y = element_blank()) +
  scale_x_continuous(expand = expansion(), breaks = seq(0, 24*7, by = 24)) +
  scale_y_continuous(expand = expansion()) +
  theme(panel.background = element_rect(colour = &amp;quot;white&amp;quot;))
p1 + inset_element(p2, 0.2, 0.2, 1, 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/submission-queue-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Many submissions are shortly spanned.
Perhaps hinting that more testing should be done before or what to expect on the review should be more clear to the authors, or that they are approved very fast, or…&lt;/p&gt;
&lt;p&gt;There are 429 packages that are only seen once.
It might mean that it is an abandoned/rejected submissions other might be acceptances in less than an hour.&lt;/p&gt;
&lt;p&gt;If we look at the folders of each submission well see different picture:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;subm2 &amp;lt;- cran_times %&amp;gt;%
  group_by(package, submission_n, folder) %&amp;gt;% 
  arrange(snapshot_time) %&amp;gt;% 
  select(package, version, submission_n, snapshot_time, folder) %&amp;gt;% 
  filter(row_number() %in% c(1, last(row_number()))) %&amp;gt;% 
  arrange(submission_n)
rsubm2 &amp;lt;- subm2 %&amp;gt;% 
  filter(n_distinct(snapshot_time) %% 2 == 0) %&amp;gt;%
  mutate(time = c(&amp;quot;start&amp;quot;, &amp;quot;end&amp;quot;)) %&amp;gt;% 
  pivot_wider(values_from = snapshot_time, names_from = time) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  mutate(r = row_number(), 
         time  =  round(difftime(end, start, units = &amp;quot;hour&amp;quot;), 0)) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  filter(!is.na(start), !is.na(end))
lv &amp;lt;- levels(fct_reorder(rsubm2$package, rsubm2$start, .fun = min, .desc = FALSE))
ggplot(rsubm2) +
  geom_rect(data = holidays, aes(xmin = start, xmax = end), 
            ymin = first(lv), ymax = last(lv), alpha = 0.5, fill = &amp;quot;red&amp;quot;) +
  geom_linerange(aes(y = fct_reorder(package, start, .fun = min, .desc = FALSE),
                      x = start, xmin = start, xmax = end, col = folder)) + 
  labs(x = element_blank(), y = element_blank(), title = 
         &amp;quot;Packages on the queue&amp;quot;) +
  scale_color_manual(values = man_colors) +
  scale_x_datetime(date_labels = &amp;quot;%Y/%m/%d&amp;quot;, date_breaks = &amp;quot;2 weeks&amp;quot;, 
               expand = expansion(add = 2)) +
  labs(col = &amp;quot;Folder&amp;quot;) +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank(),
        axis.text.y = element_blank(),
        legend.position = c(0.2, 0.7))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/resubm2-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It looks like some packages take a long time to change folder, perhaps the maintainers have troubles fixing the issues pointed by the reviewers, or don’t have time to deal with them.
Some packages are recorded in just 1 folder and some other go through multiple folders:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rsubm2 %&amp;gt;% 
  group_by(package, submission_n) %&amp;gt;% 
  summarize(n_folder = n_distinct(folder)) %&amp;gt;% 
  ggplot() + 
  geom_histogram(aes(n_folder), bins = 5) +
  labs(title = &amp;quot;Folders by submission&amp;quot;, x = element_blank(), 
       y = element_blank())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/submissions-n-folders-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Most submissions end up in one folder, but some go up to 5 folders.&lt;/p&gt;
&lt;p&gt;Let’s see the most 5 common folders process of submissions:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;compact_folders &amp;lt;- function(x) {
  y &amp;lt;- x != lag(x)
  y[1] &amp;lt;- TRUE
  x[y]
}
cran_times %&amp;gt;% 
  group_by(package, submission_n) %&amp;gt;% 
  summarize (folder = list(compact_folders(folder))) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  count(folder, sort = TRUE) %&amp;gt;% 
  top_n(5) %&amp;gt;% 
  rename(Folders = folder, Frequency = n) %&amp;gt;% 
  as.data.frame()
##            Folders Frequency
## 1          pretest      1433
## 2 pretest, inspect       422
## 3          inspect       301
## 4 pretest, newbies       279
## 5          newbies       245&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As expected the pretest and newbies are one of the most frequent folders.&lt;/p&gt;
&lt;p&gt;Another way of seeing whether it is a right moment to submit your package, aside of how many packages are on the queue, is looking how much activity there is:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;subm3 &amp;lt;- cran_times %&amp;gt;%
  arrange(snapshot_time) %&amp;gt;% 
  group_by(package) %&amp;gt;% 
  mutate(autor_change = submission_n != lag(submission_n),
         cran_change = folder != lag(folder)) %&amp;gt;% 
  mutate(autor_change = ifelse(is.na(autor_change), TRUE, autor_change),
         cran_change = ifelse(is.na(cran_change), FALSE, cran_change)) %&amp;gt;% 
  mutate(cran_change = case_when(subfolder != lag(subfolder) ~ TRUE,
                                 TRUE ~ cran_change)) %&amp;gt;% 
  ungroup()
subm3 %&amp;gt;% 
  group_by(snapshot_time) %&amp;gt;% 
  summarize(autor_change = sum(autor_change), cran_change = sum(cran_change)) %&amp;gt;% 
  filter(row_number() != 1) %&amp;gt;% 
  filter(autor_change != 0 | cran_change != 0) %&amp;gt;% 
  ggplot() +
  geom_rect(data = holidays, aes(xmin = start, xmax = end), 
            ymin = -26, ymax = 26, alpha = 0.5, fill = &amp;quot;grey&amp;quot;) +
  geom_point(aes(snapshot_time, autor_change), fill = &amp;quot;blue&amp;quot;, size = 0) +
  geom_area(aes(snapshot_time, autor_change), fill = &amp;quot;blue&amp;quot;) +
  geom_point(aes(snapshot_time, -cran_change), fill = &amp;quot;red&amp;quot;, size = 0) +
  geom_area(aes(snapshot_time, -cran_change), fill = &amp;quot;red&amp;quot;) +
  scale_x_datetime(date_labels = &amp;quot;%Y/%m/%d&amp;quot;, date_breaks = &amp;quot;2 weeks&amp;quot;, 
                   expand = expansion(add = 2)) +
  scale_y_continuous(expand = expansion(add = c(0, 0))) + 
  coord_cartesian(ylim = c(-26, 26)) +
  annotate(&amp;quot;text&amp;quot;, label = &amp;quot;CRAN&amp;#39;s&amp;quot;, y = 20, x = as_datetime(&amp;quot;2020/11/02&amp;quot;)) +
  annotate(&amp;quot;text&amp;quot;, label = &amp;quot;Maintainers&amp;#39;&amp;quot;, y = -20, x = as_datetime(&amp;quot;2020/11/02&amp;quot;)) +
  labs(y = &amp;quot;Changes&amp;quot;, x = element_blank(), title = &amp;quot;Activity on CRAN:&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/cran-pressure-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;On this plot we can see that changes on folders or submissions are not simultaneous.
But they are quite frequent.&lt;/p&gt;
&lt;div id=&#34;review-process&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Review process&lt;/h3&gt;
&lt;p&gt;There is a &lt;a href=&#34;https://lockedata.github.io/cransays/articles/dashboard.html#cran-review-workflow&#34;&gt;scheme&lt;/a&gt; about how does the review process work.
However, it has been pointed out that it needs an update.&lt;/p&gt;
&lt;p&gt;We’ve seen which folders go before which ones, but we haven’t looked up what is the last folder in which package appear:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cran_times %&amp;gt;% 
  ungroup() %&amp;gt;% 
  group_by(package, submission_n) %&amp;gt;% 
  arrange(snapshot_time) %&amp;gt;% 
  filter(1:n() == last(n())) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  count(folder, sort = TRUE) %&amp;gt;% 
  knitr::kable(col.names = c(&amp;quot;Last folder&amp;quot;, &amp;quot;Submissions&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Last folder&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Submissions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;pretest&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1653&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;newbies&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;981&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;inspect&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;890&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;recheck&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;555&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;publish&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;469&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;waiting&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;441&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;human&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;332&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;pending&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;225&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We can see that many submissions were left at the pretest, and just a minority on the human or publish folders.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;time-it-takes-to-disappear-from-the-system&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Time it takes to disappear from the system&lt;/h3&gt;
&lt;p&gt;One of the motivations to do this post was a &lt;a href=&#34;https://stat.ethz.ch/pipermail/r-package-devel/2020q4/006174.html&#34;&gt;question on R-pkg-devel&lt;/a&gt;, about how long does it usually take for a package to be accepted on CRAN.
We can look how long does each submission take until it is removed from the queue:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;package_submissions &amp;lt;- cran_times %&amp;gt;% 
  group_by(package, submission_n) %&amp;gt;% 
  summarise(submission_period = difftime(max(snapshot_time), 
                                         min(snapshot_time), 
                                         units = &amp;quot;hour&amp;quot;),
            submission_time = min(snapshot_time)) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  filter(submission_period != 0)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is a good approximation about how long does it take a package to be accepted or rejected, but some packages remain on the queue after they are accepted and appear on CRAN.
Joining this data with data from &lt;a href=&#34;https://r-pkg.org/&#34;&gt;metacran&lt;/a&gt; we could know how often does this happend.
But I leave that for the reader or some other posts.
Let’s go back to time spend on the queue:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;package_submissions %&amp;gt;% 
  # filter(submission_time &amp;lt; holidays$start) %&amp;gt;% 
  ggplot() +
  geom_point(aes(submission_time, submission_period, col = submission_n)) +
  geom_rect(data = holidays, aes(xmin = start, xmax = end),
            ymin = 0, ymax = 3500, alpha = 0.5, fill = &amp;quot;red&amp;quot;) + 
  scale_x_datetime(date_labels = &amp;quot;%Y/%m/%d&amp;quot;, date_breaks = &amp;quot;2 weeks&amp;quot;,
                   expand = expansion(add = 10)) +
  scale_y_continuous(expand = expansion(add = 10)) +
  labs(title = &amp;quot;Time on the queue according to the submission&amp;quot;,
       x = &amp;quot;Submission&amp;quot;, y = &amp;quot;Time (hours)&amp;quot;, col = &amp;quot;Submission&amp;quot;) +
  theme(legend.position = c(0.5, 0.8))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/time-time-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;These diagonals suggest that the work is done in batches in a day or an afternoon.
And the prominent diagonal after holidays are packages still on the queue.&lt;/p&gt;
&lt;p&gt;If we summarize by day and take the median of all the first package submission we can see how much long is a package on the queue:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;package_submissions %&amp;gt;% 
  filter(submission_n == 1) %&amp;gt;% 
  ungroup() %&amp;gt;%
  mutate(d = as.Date(submission_time)) %&amp;gt;%
  group_by(d) %&amp;gt;% 
  summarize(m = median(submission_period)) %&amp;gt;% 
  ggplot() +
  geom_rect(data = holidays, aes(xmin = as.Date(start), xmax = as.Date(end)),
            ymin = 0, ymax = 80, alpha = 0.5, fill = &amp;quot;red&amp;quot;) + 
  geom_smooth(aes(d, m)) +
  coord_cartesian(ylim = c(0, NA)) +
  scale_x_date(date_labels = &amp;quot;%Y/%m/%d&amp;quot;, date_breaks = &amp;quot;2 weeks&amp;quot;,
                   expand = expansion(add = 1)) +
  labs(x = element_blank(), y = &amp;quot;Daily median time in queue (hours)&amp;quot;, 
       title = &amp;quot;Submission time&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/daily-submission-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can see that usually it takes more than a day to disappear from the queue for a new package submitted on CRAN.&lt;/p&gt;
&lt;p&gt;There is a lot of variation among submissions:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;package_submissions %&amp;gt;% 
  group_by(submission_n) %&amp;gt;% 
  mutate(submission_n = as.character(submission_n)) %&amp;gt;% 
  ggplot() +
  geom_jitter(aes(submission_n, submission_period), height = 0) +
  scale_y_continuous(limits = c(1, NA), expand = expansion(add = c(1, 10)),
                     breaks = seq(0,  4550, by = 24*7)) +
  labs(title = &amp;quot;Submission time in queue&amp;quot;, y = &amp;quot;Hours&amp;quot;, x = element_blank())
## Warning: Removed 142 rows containing missing values (geom_point).&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/submission-progression-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Surprisingly sometimes a submissions goes missing from the folders for some days (I checked with one package I submitted and it doesn’t appear for 7 days, although it was on the queue).
This might affect this analysis as it will count them as new submissions but some of them won’t be.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;package_submissions %&amp;gt;% 
  filter(submission_period != 0) %&amp;gt;% 
  group_by(submission_n) %&amp;gt;% 
  mutate(submission_n = as.character(submission_n)) %&amp;gt;% 
  filter(n() &amp;gt; 5) %&amp;gt;% 
  summarize(median = round(median(submission_period), 2)) %&amp;gt;% 
  knitr::kable(col.names = c(&amp;quot;Submission&amp;quot;, &amp;quot;Median time (h)&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Submission&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Median time (h)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;36.13 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;18.27 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;16.47 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;11.27 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;13.37 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;38.08 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;So it usually takes more than a day for new packages and later is around 12 hours.&lt;/p&gt;
&lt;p&gt;To set in context the work done by CRAN checking system, which is the one that helps to keep the high quality of the packages, let’s explore other checking system: GitHub actions.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;GHAR&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;GitHub action reliability&lt;/h2&gt;
&lt;p&gt;The data of this post was collected using Github actions by cransays.
Well use this data to test how reliable GitHub actions are.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gha &amp;lt;- cbind(cran_times[, c(&amp;quot;month&amp;quot;, &amp;quot;mday&amp;quot;, &amp;quot;wday&amp;quot;, &amp;quot;week&amp;quot;)], 
      minute = minute(cran_submissions$snapshot_time), 
      hour = hour(cran_times$snapshot_time),
      type = &amp;quot;cransays&amp;quot;) %&amp;gt;% 
  distinct()
gha %&amp;gt;% 
  ggplot() +
  geom_violin(aes(as.factor(hour), minute)) +
  scale_y_continuous(expand = expansion(add = 0.5), 
                     breaks = c(0, 15, 30, 45, 60), limits = c(0, 60)) +
  scale_x_discrete(expand = expansion())  +
  labs(x = &amp;quot;Hour&amp;quot;, y = &amp;quot;Minute&amp;quot;, title = &amp;quot;Daily variation&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2021/01/31/cran-review/index_files/figure-html/gha2-1.png&#34; width=&#34;120%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;There seems to be a lower limit around 10 minutes except for some builds that I think were manually triggered.
Aside from this, there is usually low variation and the process end around ~ 15 minutes but it can end much later.
This is just for one simple script scrapping a site.
Compared to thousands of packages builds and checks it is much simpler.&lt;/p&gt;
&lt;p&gt;And last how reliable it is?&lt;/p&gt;
&lt;p&gt;We can compare how many hours are between the first and the last report and how many do we have recorded.
If we have less this indicate errors on GHA.&lt;/p&gt;
&lt;p&gt;So the script and github actions worked on ~96.9646576% of the times.&lt;/p&gt;
&lt;p&gt;These numbers are great, but on CRAN and Bioconductor all packages are checked daily on several OS consistently.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Some of the most important points from this post:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Some packages appear on several folders and even there are times that multiple versions of a package are on the queue.&lt;/li&gt;
&lt;li&gt;Most submission happens in the first days of the week and towards the beginning of the month.&lt;/li&gt;
&lt;li&gt;Most of the submissions disappear from the CRAN queue in less than a day but new submission take around 36 hours.&lt;/li&gt;
&lt;li&gt;There’s a new package submission to CRAN every hour.&lt;/li&gt;
&lt;li&gt;In later submissions time in the queue is considerably shorter.&lt;/li&gt;
&lt;li&gt;It was impossible to know when there was a reply from CRAN, as no information is provided.&lt;/li&gt;
&lt;li&gt;Not possible to know when a package has all OK before it hits CRAN as some packages remain in the queue even after acceptance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Last, I compare the review system with other reviews of R software like Bioconductor and rOpenSci.&lt;/p&gt;
&lt;p&gt;One big difference between CRAN and Bioconductor or rOpenSci is that even if your package is already included each time you want to fix something it gets reviewed by someone.
This ensures a high quality of the packages, as well as increases the work for the reviewers.&lt;/p&gt;
&lt;p&gt;Also, the list of reviewers as far as I know is just 5 people who also are part of the developers maintaining and developing R.
In Bioconductor it is the same (except they do not to take care of R itself) but in rOpenSci this doesn’t happen that way.&lt;/p&gt;
&lt;p&gt;The next big difference is the lack of transparency on the process of the review itself.
Perhaps because CRAN started earlier (1997) while Bioconductor in 2002 and rOpenSci much later.
On CRAN with the information available we don’t know the steps to be accepted beyond the pretest and the manual review.
We don’t know when the package is accepted or rejected, what is the content of the feedback to the maintainer (or when there is feedback and how long does the maintainer get to address those changes).
It is no clear how does the process work.
Additionally, the reviewer’s work seems highly manual as we found some duplicated packages on the queue.&lt;/p&gt;
&lt;p&gt;Further automatization and transparency on the process could help reduce the load on the reviewers, as well as increasing the number of reviewers.
A public review could help reduce the burden to CRAN reviewers as outsiders could help solving errors (although this is somewhat already fulfilled by the &lt;a href=&#34;https://www.r-project.org/mail.html&#34;&gt;mailing list&lt;/a&gt; R-package-devel), and would help notice and find a compromise on inconsistencies between reviews.
As anecdotal evidence I submitted two packages one shortly after the first, on the second package they asked me to change some URLS that on the first I wasn’t required to change.&lt;/p&gt;
&lt;p&gt;Another difference between these three repositories are the manuals.
It seems that CRAN repository is equated to be R, so the &lt;a href=&#34;https://cran.r-project.org/doc/manuals/r-release/R-exts.html&#34;&gt;manual for writing R extensions&lt;/a&gt; is under &lt;code&gt;cran.r-project.org&lt;/code&gt;, while this is about extending R and can and does happen outside CRAN.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&#34;https://cran.r-project.org/web/packages/policies.html&#34;&gt;CRAN policies&lt;/a&gt; changes without notice to the existing developers.
Sending an email to the maintainers or R-announce mailing list would help developers to notice policy changes.
Developers had to create a &lt;a href=&#34;http://dirk.eddelbuettel.com/blog/2013/10/23/&#34;&gt;policy watch&lt;/a&gt; and other resources to &lt;a href=&#34;https://blog.r-hub.io/2019/05/29/keep-up-with-cran/&#34;&gt;keep up with CRAN&lt;/a&gt; changes that not only affect them on submitting a package but also on packages already included on CRAN.&lt;/p&gt;
&lt;p&gt;The CRAN reviewers are involved on multiple demanding tasks: their own regular jobs and outside work (familiar, friend, other interests) commitments, and then, R development and maintenance, CRAN reviews and maintenance, R-journal&lt;a href=&#34;#fn3&#34; class=&#34;footnote-ref&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;.
One possible solution to reduce the burden for them is increase the number of reviewers.
Perhaps a mentorship program to review packages or a guideline to what to check training and would help reduce the pressure on the current volunteers.&lt;/p&gt;
&lt;p&gt;The peace and work of the maintainers as seen on this analysis is huge, and much more that cannot be seen with this data.
Many thanks to all the volunteers that maintain it, those who donate to the R Foundation and the employers of those volunteers that make possible CRAN and R.&lt;/p&gt;
&lt;div id=&#34;reproducibility&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Reproducibility&lt;/h3&gt;
&lt;details&gt;
&lt;pre&gt;&lt;code&gt;## NULL
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.1 (2020-06-06)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Madrid               
##  date     2021-08-25                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package      * version     date       lib source                              
##  assertthat     0.2.1       2019-03-21 [1] CRAN (R 4.0.1)                      
##  backports      1.2.1       2020-12-09 [1] CRAN (R 4.0.1)                      
##  blogdown       1.3         2021-04-14 [1] CRAN (R 4.0.1)                      
##  bookdown       0.22        2021-04-22 [1] CRAN (R 4.0.1)                      
##  broom          0.7.6       2021-04-05 [1] CRAN (R 4.0.1)                      
##  bslib          0.2.5       2021-05-12 [1] CRAN (R 4.0.1)                      
##  cellranger     1.1.0       2016-07-27 [1] CRAN (R 4.0.1)                      
##  cli            2.5.0       2021-04-26 [1] CRAN (R 4.0.1)                      
##  colorspace     2.0-1       2021-05-04 [1] CRAN (R 4.0.1)                      
##  crayon         1.4.1       2021-02-08 [1] CRAN (R 4.0.1)                      
##  DBI            1.1.1       2021-01-15 [1] CRAN (R 4.0.1)                      
##  dbplyr         2.1.1       2021-04-06 [1] CRAN (R 4.0.1)                      
##  digest         0.6.27      2020-10-24 [1] CRAN (R 4.0.1)                      
##  dplyr        * 1.0.6       2021-05-05 [1] CRAN (R 4.0.1)                      
##  ellipsis       0.3.2       2021-04-29 [1] CRAN (R 4.0.1)                      
##  evaluate       0.14        2019-05-28 [1] CRAN (R 4.0.1)                      
##  fansi          0.5.0       2021-05-25 [1] CRAN (R 4.0.1)                      
##  farver         2.1.0       2021-02-28 [1] CRAN (R 4.0.1)                      
##  forcats      * 0.5.1       2021-01-27 [1] CRAN (R 4.0.1)                      
##  fs             1.5.0       2020-07-31 [1] CRAN (R 4.0.1)                      
##  generics       0.1.0       2020-10-31 [1] CRAN (R 4.0.1)                      
##  ggplot2      * 3.3.5       2021-06-25 [1] CRAN (R 4.0.1)                      
##  glue           1.4.2       2020-08-27 [1] CRAN (R 4.0.1)                      
##  gtable         0.3.0       2019-03-25 [1] CRAN (R 4.0.1)                      
##  haven          2.4.1       2021-04-23 [1] CRAN (R 4.0.1)                      
##  here           1.0.1       2020-12-13 [1] CRAN (R 4.0.1)                      
##  highr          0.9         2021-04-16 [1] CRAN (R 4.0.1)                      
##  hms          * 1.0.0       2021-01-13 [1] CRAN (R 4.0.1)                      
##  htmltools      0.5.1.1     2021-01-22 [1] CRAN (R 4.0.1)                      
##  httr           1.4.2       2020-07-20 [1] CRAN (R 4.0.1)                      
##  jquerylib      0.1.4       2021-04-26 [1] CRAN (R 4.0.1)                      
##  jsonlite       1.7.2       2020-12-09 [1] CRAN (R 4.0.1)                      
##  knitr          1.33        2021-04-24 [1] CRAN (R 4.0.1)                      
##  labeling       0.4.2       2020-10-20 [1] CRAN (R 4.0.1)                      
##  lattice        0.20-41     2020-04-02 [1] CRAN (R 4.0.1)                      
##  lifecycle      1.0.0       2021-02-15 [1] CRAN (R 4.0.1)                      
##  lubridate    * 1.7.10.9000 2021-06-12 [1] Github (tidyverse/lubridate@1e0d66f)
##  magrittr       2.0.1       2020-11-17 [1] CRAN (R 4.0.1)                      
##  Matrix         1.3-2       2021-01-06 [1] CRAN (R 4.0.1)                      
##  mgcv           1.8-35      2021-04-18 [1] CRAN (R 4.0.1)                      
##  modelr         0.1.8       2020-05-19 [1] CRAN (R 4.0.1)                      
##  munsell        0.5.0       2018-06-12 [1] CRAN (R 4.0.1)                      
##  nlme           3.1-152     2021-02-04 [1] CRAN (R 4.0.1)                      
##  patchwork    * 1.1.1       2020-12-17 [1] CRAN (R 4.0.1)                      
##  pillar         1.6.1       2021-05-16 [1] CRAN (R 4.0.1)                      
##  pkgconfig      2.0.3       2019-09-22 [1] CRAN (R 4.0.1)                      
##  purrr        * 0.3.4       2020-04-17 [1] CRAN (R 4.0.1)                      
##  R6             2.5.0       2020-10-28 [1] CRAN (R 4.0.1)                      
##  RColorBrewer   1.1-2       2014-12-07 [1] CRAN (R 4.0.1)                      
##  Rcpp           1.0.6       2021-01-15 [1] CRAN (R 4.0.1)                      
##  readr        * 1.4.0       2020-10-05 [1] CRAN (R 4.0.1)                      
##  readxl         1.3.1       2019-03-13 [1] CRAN (R 4.0.1)                      
##  reprex         2.0.0       2021-04-02 [1] CRAN (R 4.0.1)                      
##  rlang          0.4.11      2021-04-30 [1] CRAN (R 4.0.1)                      
##  rmarkdown      2.9         2021-06-15 [1] CRAN (R 4.0.1)                      
##  rprojroot      2.0.2       2020-11-15 [1] CRAN (R 4.0.1)                      
##  rstudioapi     0.13        2020-11-12 [1] CRAN (R 4.0.1)                      
##  rvest          1.0.0       2021-03-09 [1] CRAN (R 4.0.1)                      
##  sass           0.4.0       2021-05-12 [1] CRAN (R 4.0.1)                      
##  scales         1.1.1       2020-05-11 [1] CRAN (R 4.0.1)                      
##  sessioninfo    1.1.1       2018-11-05 [1] CRAN (R 4.0.1)                      
##  stringi        1.6.2       2021-05-17 [1] CRAN (R 4.0.1)                      
##  stringr      * 1.4.0       2019-02-10 [1] CRAN (R 4.0.1)                      
##  tibble       * 3.1.2       2021-05-16 [1] CRAN (R 4.0.1)                      
##  tidyr        * 1.1.3       2021-03-03 [1] CRAN (R 4.0.1)                      
##  tidyselect     1.1.1       2021-04-30 [1] CRAN (R 4.0.1)                      
##  tidyverse    * 1.3.1       2021-04-15 [1] CRAN (R 4.0.1)                      
##  utf8           1.2.1       2021-03-12 [1] CRAN (R 4.0.1)                      
##  vctrs          0.3.8       2021-04-29 [1] CRAN (R 4.0.1)                      
##  viridisLite    0.4.0       2021-04-13 [1] CRAN (R 4.0.1)                      
##  withr          2.4.2       2021-04-18 [1] CRAN (R 4.0.1)                      
##  xfun           0.24        2021-06-15 [1] CRAN (R 4.0.1)                      
##  xml2           1.3.2       2020-04-23 [1] CRAN (R 4.0.1)                      
##  yaml           2.2.1       2020-02-01 [1] CRAN (R 4.0.1)                      
## 
## [1] /home/lluis/bin/R/4.0.1/lib/R/library&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Or a problem with ggplot2 representing a sudden value that is much different from those around them.&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;Which now I cannot find the evidence to link to.
If anyone finds the tweet I would appreciate it.&lt;a href=&#34;#fnref2&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;I’m not aware of anyone whose full job is just R reviewing.&lt;a href=&#34;#fnref3&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>CRAN dependencies</title>
      <link>https://llrs.dev/post/2020/01/25/cran-dependencies/</link>
      <pubDate>Sat, 25 Jan 2020 00:00:00 +0000</pubDate>
      <guid>https://llrs.dev/post/2020/01/25/cran-dependencies/</guid>
      <description>
&lt;script src=&#34;https://llrs.dev/post/2020/01/25/cran-dependencies/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;new-policy-in-cran&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;New policy in CRAN&lt;/h1&gt;
&lt;p&gt;With the new policy of a maximum of 20 packages in Imports.
Let’s see how many dependencies has each package on CRAN and Bioconductor:&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;cran&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;CRAN&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ap &amp;lt;- available.packages()
dp &amp;lt;- tools::package_dependencies(rownames(ap), db = ap, which = &amp;quot;Imports&amp;quot;, 
                                  recursive = FALSE)
dp_n &amp;lt;- lengths(dp)
tb_dp &amp;lt;- sort(table(dp_n), decreasing = TRUE)
barplot(tb_dp)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2020/01/25/cran-dependencies/index_files/figure-html/cran-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Most of the packages has 0 dependencies and just 212 has 20 dependencies or more:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;names(dp_n)[dp_n &amp;gt;= 20]
##   [1] &amp;quot;AdhereRViz&amp;quot;         &amp;quot;AFM&amp;quot;                &amp;quot;AirSensor&amp;quot;         
##   [4] &amp;quot;alookr&amp;quot;             &amp;quot;amt&amp;quot;                &amp;quot;animaltracker&amp;quot;     
##   [7] &amp;quot;antaresViz&amp;quot;         &amp;quot;BAMBI&amp;quot;              &amp;quot;BasketballAnalyzeR&amp;quot;
##  [10] &amp;quot;BAwiR&amp;quot;              &amp;quot;bea.R&amp;quot;              &amp;quot;BETS&amp;quot;              
##  [13] &amp;quot;bibliometrix&amp;quot;       &amp;quot;biomod2&amp;quot;            &amp;quot;bioRad&amp;quot;            
##  [16] &amp;quot;BIRDS&amp;quot;              &amp;quot;bootnet&amp;quot;            &amp;quot;bpcs&amp;quot;              
##  [19] &amp;quot;breathtestcore&amp;quot;     &amp;quot;brms&amp;quot;               &amp;quot;card&amp;quot;              
##  [22] &amp;quot;chemmodlab&amp;quot;         &amp;quot;chillR&amp;quot;             &amp;quot;Clustering&amp;quot;        
##  [25] &amp;quot;CNVScope&amp;quot;           &amp;quot;codebook&amp;quot;           &amp;quot;cSEM&amp;quot;              
##  [28] &amp;quot;ctmm&amp;quot;               &amp;quot;ctsem&amp;quot;              &amp;quot;DAMisc&amp;quot;            
##  [31] &amp;quot;dartR&amp;quot;              &amp;quot;datacleanr&amp;quot;         &amp;quot;dccvalidator&amp;quot;      
##  [34] &amp;quot;devtools&amp;quot;           &amp;quot;dextergui&amp;quot;          &amp;quot;diceR&amp;quot;             
##  [37] &amp;quot;dipsaus&amp;quot;            &amp;quot;DIscBIO&amp;quot;            &amp;quot;distill&amp;quot;           
##  [40] &amp;quot;dlookr&amp;quot;             &amp;quot;dragon&amp;quot;             &amp;quot;drhur&amp;quot;             
##  [43] &amp;quot;dyngen&amp;quot;             &amp;quot;dynwrap&amp;quot;            &amp;quot;ebirdst&amp;quot;           
##  [46] &amp;quot;ecd&amp;quot;                &amp;quot;ecochange&amp;quot;          &amp;quot;EcoGenetics&amp;quot;       
##  [49] &amp;quot;ecospat&amp;quot;            &amp;quot;EFAtools&amp;quot;           &amp;quot;eiCompare&amp;quot;         
##  [52] &amp;quot;elementR&amp;quot;           &amp;quot;emdi&amp;quot;               &amp;quot;emuR&amp;quot;              
##  [55] &amp;quot;eph&amp;quot;                &amp;quot;EpiNow2&amp;quot;            &amp;quot;epitweetr&amp;quot;         
##  [58] &amp;quot;fdm2id&amp;quot;             &amp;quot;FedData&amp;quot;            &amp;quot;finalfit&amp;quot;          
##  [61] &amp;quot;forestmangr&amp;quot;        &amp;quot;genBaRcode&amp;quot;         &amp;quot;geoviz&amp;quot;            
##  [64] &amp;quot;ggquickeda&amp;quot;         &amp;quot;GJRM&amp;quot;               &amp;quot;GmAMisc&amp;quot;           
##  [67] &amp;quot;golem&amp;quot;              &amp;quot;graph4lg&amp;quot;           &amp;quot;GWSDAT&amp;quot;            
##  [70] &amp;quot;hdpGLM&amp;quot;             &amp;quot;highcharter&amp;quot;        &amp;quot;hmi&amp;quot;               
##  [73] &amp;quot;htsr&amp;quot;               &amp;quot;hybridEnsemble&amp;quot;     &amp;quot;iCellR&amp;quot;            
##  [76] &amp;quot;immunarch&amp;quot;          &amp;quot;inlmisc&amp;quot;            &amp;quot;IntClust&amp;quot;          
##  [79] &amp;quot;iNZightTools&amp;quot;       &amp;quot;IOHanalyzer&amp;quot;        &amp;quot;isoreader&amp;quot;         
##  [82] &amp;quot;ITNr&amp;quot;               &amp;quot;jmv&amp;quot;                &amp;quot;jsmodule&amp;quot;          
##  [85] &amp;quot;JWileymisc&amp;quot;         &amp;quot;KarsTS&amp;quot;             &amp;quot;lilikoi&amp;quot;           
##  [88] &amp;quot;mdapack&amp;quot;            &amp;quot;memapp&amp;quot;             &amp;quot;metacoder&amp;quot;         
##  [91] &amp;quot;MetaDBparse&amp;quot;        &amp;quot;MetaIntegrator&amp;quot;     &amp;quot;microbial&amp;quot;         
##  [94] &amp;quot;missCompare&amp;quot;        &amp;quot;mlflow&amp;quot;             &amp;quot;modchart&amp;quot;          
##  [97] &amp;quot;modeltime&amp;quot;          &amp;quot;modeltime.ensemble&amp;quot; &amp;quot;modeltime.resample&amp;quot;
## [100] &amp;quot;momentuHMM&amp;quot;        
##  [ reached getOption(&amp;quot;max.print&amp;quot;) -- omitted 112 entries ]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;bioconductor&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Bioconductor&lt;/h1&gt;
&lt;p&gt;The interesting part is discovering how to add the repository.
The trick is to make us of &lt;code&gt;BiocManager::repositories()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;BioC_software &amp;lt;- BiocManager::repositories()[&amp;quot;BioCsoft&amp;quot;]
bp &amp;lt;- available.packages(contriburl = contrib.url(BioC_software))
dp_BioC &amp;lt;- tools::package_dependencies(rownames(bp), db = bp, which = &amp;quot;Imports&amp;quot;, 
                                  recursive = FALSE)
dp_BioC_n &amp;lt;- lengths(dp_BioC)
tb_dp_BioC &amp;lt;- sort(table(dp_BioC_n), decreasing = TRUE)
barplot(tb_dp_BioC)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://llrs.dev/post/2020/01/25/cran-dependencies/index_files/figure-html/BioC-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It seems that software packages on Bioconductor tend to have more dependencies than on CRAN.
If this policy would be implemented on Bioconductor it would affect 219 packages:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;names(dp_BioC_n)[dp_BioC_n &amp;gt;= 20]
##   [1] &amp;quot;abseqR&amp;quot;              &amp;quot;adductomicsR&amp;quot;        &amp;quot;ALPS&amp;quot;               
##   [4] &amp;quot;AlpsNMR&amp;quot;             &amp;quot;AMARETTO&amp;quot;            &amp;quot;amplican&amp;quot;           
##   [7] &amp;quot;AneuFinder&amp;quot;          &amp;quot;animalcules&amp;quot;         &amp;quot;appreci8R&amp;quot;          
##  [10] &amp;quot;ArrayExpressHTS&amp;quot;     &amp;quot;arrayQualityMetrics&amp;quot; &amp;quot;artMS&amp;quot;              
##  [13] &amp;quot;ASpediaFI&amp;quot;           &amp;quot;ATACseqQC&amp;quot;           &amp;quot;BASiCS&amp;quot;             
##  [16] &amp;quot;BatchQC&amp;quot;             &amp;quot;bigPint&amp;quot;             &amp;quot;bioCancer&amp;quot;          
##  [19] &amp;quot;BiocOncoTK&amp;quot;          &amp;quot;BiocPkgTools&amp;quot;        &amp;quot;biovizBase&amp;quot;         
##  [22] &amp;quot;biscuiteer&amp;quot;          &amp;quot;BPRMeth&amp;quot;             &amp;quot;bsseq&amp;quot;              
##  [25] &amp;quot;BUSpaRse&amp;quot;            &amp;quot;CAGEr&amp;quot;               &amp;quot;CATALYST&amp;quot;           
##  [28] &amp;quot;celda&amp;quot;               &amp;quot;CEMiTool&amp;quot;            &amp;quot;CeTF&amp;quot;               
##  [31] &amp;quot;ChAMP&amp;quot;               &amp;quot;chimeraviz&amp;quot;          &amp;quot;chipenrich&amp;quot;         
##  [34] &amp;quot;ChIPpeakAnno&amp;quot;        &amp;quot;ChIPQC&amp;quot;              &amp;quot;ChIPseeker&amp;quot;         
##  [37] &amp;quot;ChromSCape&amp;quot;          &amp;quot;chromVAR&amp;quot;            &amp;quot;cicero&amp;quot;             
##  [40] &amp;quot;circRNAprofiler&amp;quot;     &amp;quot;CiteFuse&amp;quot;            &amp;quot;clusterExperiment&amp;quot;  
##  [43] &amp;quot;clustifyr&amp;quot;           &amp;quot;CNEr&amp;quot;                &amp;quot;CNVPanelizer&amp;quot;       
##  [46] &amp;quot;CNVRanger&amp;quot;           &amp;quot;cola&amp;quot;                &amp;quot;COMPASS&amp;quot;            
##  [49] &amp;quot;compcodeR&amp;quot;           &amp;quot;CONFESS&amp;quot;             &amp;quot;consensusDE&amp;quot;        
##  [52] &amp;quot;contiBAIT&amp;quot;           &amp;quot;crlmm&amp;quot;               &amp;quot;crossmeta&amp;quot;          
##  [55] &amp;quot;cTRAP&amp;quot;               &amp;quot;CytoML&amp;quot;              &amp;quot;CytoTree&amp;quot;           
##  [58] &amp;quot;DAMEfinder&amp;quot;          &amp;quot;DaMiRseq&amp;quot;            &amp;quot;debrowser&amp;quot;          
##  [61] &amp;quot;deco&amp;quot;                &amp;quot;decompTumor2Sig&amp;quot;     &amp;quot;DEGreport&amp;quot;          
##  [64] &amp;quot;DEP&amp;quot;                 &amp;quot;DepecheR&amp;quot;            &amp;quot;destiny&amp;quot;            
##  [67] &amp;quot;DEsubs&amp;quot;              &amp;quot;DiffBind&amp;quot;            &amp;quot;diffcyt&amp;quot;            
##  [70] &amp;quot;diffHic&amp;quot;             &amp;quot;diffloop&amp;quot;            &amp;quot;DiscoRhythm&amp;quot;        
##  [73] &amp;quot;dmrseq&amp;quot;              &amp;quot;Doscheda&amp;quot;            &amp;quot;EGSEA&amp;quot;              
##  [76] &amp;quot;ELMER&amp;quot;               &amp;quot;ENmix&amp;quot;               &amp;quot;enrichTF&amp;quot;           
##  [79] &amp;quot;esATAC&amp;quot;              &amp;quot;EventPointer&amp;quot;        &amp;quot;exomePeak2&amp;quot;         
##  [82] &amp;quot;fcoex&amp;quot;               &amp;quot;FindMyFriends&amp;quot;       &amp;quot;flowSpy&amp;quot;            
##  [85] &amp;quot;flowWorkspace&amp;quot;       &amp;quot;FRASER&amp;quot;              &amp;quot;GAPGOM&amp;quot;             
##  [88] &amp;quot;GENESIS&amp;quot;             &amp;quot;GeneTonic&amp;quot;           &amp;quot;genomation&amp;quot;         
##  [91] &amp;quot;GenomicInteractions&amp;quot; &amp;quot;GenVisR&amp;quot;             &amp;quot;ggbio&amp;quot;              
##  [94] &amp;quot;GGtools&amp;quot;             &amp;quot;GladiaTOX&amp;quot;           &amp;quot;GmicR&amp;quot;              
##  [97] &amp;quot;gQTLstats&amp;quot;           &amp;quot;Gviz&amp;quot;                &amp;quot;GWENA&amp;quot;              
## [100] &amp;quot;HiCBricks&amp;quot;          
##  [ reached getOption(&amp;quot;max.print&amp;quot;) -- omitted 119 entries ]&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;reproducibility&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Reproducibility&lt;/h3&gt;
&lt;details&gt;
&lt;pre&gt;&lt;code&gt;## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.1 (2020-06-06)
##  os       Ubuntu 20.04.1 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Madrid               
##  date     2021-01-08                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.1)                   
##  BiocManager   1.30.10 2019-11-16 [1] CRAN (R 4.0.1)                   
##  blogdown      0.21.84 2021-01-07 [1] Github (rstudio/blogdown@c4fbb58)
##  bookdown      0.21    2020-10-13 [1] CRAN (R 4.0.1)                   
##  cli           2.2.0   2020-11-20 [1] CRAN (R 4.0.1)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.1)                   
##  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.1)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.1)                   
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.1)                   
##  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.1)                   
##  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.1)                   
##  knitr         1.30    2020-09-22 [1] CRAN (R 4.0.1)                   
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.1)                   
##  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.1)                   
##  rmarkdown     2.6     2020-12-14 [1] CRAN (R 4.0.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.1)                   
##  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.1)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.1)                   
##  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.1)                   
##  xfun          0.20    2021-01-06 [1] CRAN (R 4.0.1)                   
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.1)                   
## 
## [1] /home/lluis/bin/R/4.0.1/lib/R/library&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
