twitter | B101nfo

R quarantine house

Sat, 25 Apr 2020 00:00:00 +0000

So I found this funny tweet

What's your R quarantine house? I'm definitely 5 pic.twitter.com/h7aiijOqK0
— Jacqueline Nolis (@skyetetra) April 24, 2020

And Tyler Morgan made the “joke” to check the dependencies. So, let’s check them:

List libraries

First we set up the original choices:

env1 <- c("ggplot2", "dplyr", "data.table", "purrr")
env2 <- c("forecats", "glue", "jsonlite", "rmarkdown")
env3 <- c("shiny", "rayshader", "stringr", "tidytext")
env4 <- c("devtools", "xml2", "tidyr", "tibble")
env5 <- c("reticulate", "keras", "plumber", "usethis")
env6 <- c("blogdown", "brickr", "lubridate", "igraph")
quarantines <- list(env1 = env1, env2 = env2, 
                    env3 = env3, env4 = env4, 
                    env5 = env5, env6 = env6)

Dependencies

All of them are on CRAN (and I don’t have them installed on my computer) so let’s retrieve the available packages from CRAN. Then we can check how many unique packages are needed for each one:

library("tools")
ap <- available.packages()
unique_dep <- function(sets, db) {
  pd <- package_dependencies(packages = sets, recursive = TRUE, db = db)
  unique(unlist(pd))
}

uniq_p <- lapply(quarantines, unique_dep, db = ap)
sort(lengths(uniq_p))
## env2 env1 env5 env3 env4 env6 
##   22   59   63   89   91   96

So the environment with more dependencies is the third and the second is the one with least dependencies.

Similarity of the environments

We’ve seen that the number of package is quite different. But how many of them is shared? A little time ago I wrote a package aimed to this: {BioCor} you can install it from Bioconductor. I’ll use it now:

library("BioCor")
similarity <- mpathSim(names(uniq_p), inverseList(uniq_p), method = NULL)
similarity
##           env1      env2      env3      env4      env5      env6
## env1 1.0000000 0.2716049 0.5675676 0.5733333 0.5081967 0.7612903
## env2 0.2716049 1.0000000 0.3783784 0.3716814 0.3294118 0.3728814
## env3 0.5675676 0.3783784 1.0000000 0.5666667 0.4868421 0.7783784
## env4 0.5733333 0.3716814 0.5666667 1.0000000 0.6623377 0.6737968
## env5 0.5081967 0.3294118 0.4868421 0.6623377 1.0000000 0.5031447
## env6 0.7612903 0.3728814 0.7783784 0.6737968 0.5031447 1.0000000

The closer to 1 it means that they share more dependencies, so the most different are the environment 1 and the environment 2 We can see that the most similar packages are the environment 1 and environment 6 and that the environment 6 is the one with higher similarity to the other sets.

Which quarantine environment has some of the others?

So some of these environments call other packages from the other environments as dependencies. We can now look for how many of them:

inside_calls <- lapply(uniq_p, function(x, y) {
  # Look how many packages of each set is on the dependencies of this set
  vapply(y, function(z, x) { 
    sum(z %in% x)
  }, x = x, numeric(1L))
}, y = quarantines)
# Simplify and name for easier understanding
inside <- simplify2array(inside_calls)
names(dimnames(inside)) <- list("Package of", "Inside of")
inside
##           Inside of
## Package of env1 env2 env3 env4 env5 env6
##       env1    1    0    2    2    1    3
##       env2    2    2    2    2    2    3
##       env3    0    1    2    1    0    2
##       env4    1    0    1    2    0    2
##       env5    0    0    0    1    1    0
##       env6    0    0    0    0    0    0
colSums(inside)-diag(inside) # To avoid counting self-dependencies
## env1 env2 env3 env4 env5 env6 
##    3    1    5    6    3   10

We can see that environment 6 has more packages from the other environments.

Chances of survival:

Someone mentioned that the {survival} package wasn’t on any environment. But it might be on the dependencies:

vapply(uniq_p, function(x){"survival" %in% x},  logical(1L))
##  env1  env2  env3  env4  env5  env6 
## FALSE FALSE FALSE FALSE FALSE FALSE

No, it seems like we won’t survive well with this environments :)

Conclusions

Environment 6 is the one with more packages from the other environments, but if you want to have more packages use the second one. What you can do with these packages on a quarantine is harder to say :D

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.1 (2020-06-06)
##  os       Ubuntu 20.04.1 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Madrid               
##  date     2021-01-08                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package       * version  date       lib source                           
##  annotate        1.68.0   2020-10-27 [1] Bioconductor                     
##  AnnotationDbi   1.52.0   2020-10-27 [1] Bioconductor                     
##  assertthat      0.2.1    2019-03-21 [1] CRAN (R 4.0.1)                   
##  Biobase         2.50.0   2020-10-27 [1] Bioconductor                     
##  BiocGenerics    0.36.0   2020-10-27 [1] Bioconductor                     
##  BioCor        * 1.14.0   2020-10-27 [1] Bioconductor                     
##  BiocParallel    1.24.1   2020-11-06 [1] Bioconductor                     
##  bit             4.0.4    2020-08-04 [1] CRAN (R 4.0.1)                   
##  bit64           4.0.5    2020-08-30 [1] CRAN (R 4.0.1)                   
##  blob            1.2.1    2020-01-20 [1] CRAN (R 4.0.1)                   
##  blogdown        0.21.84  2021-01-07 [1] Github (rstudio/blogdown@c4fbb58)
##  bookdown        0.21     2020-10-13 [1] CRAN (R 4.0.1)                   
##  cli             2.2.0    2020-11-20 [1] CRAN (R 4.0.1)                   
##  crayon          1.3.4    2017-09-16 [1] CRAN (R 4.0.1)                   
##  DBI             1.1.0    2019-12-15 [1] CRAN (R 4.0.1)                   
##  digest          0.6.27   2020-10-24 [1] CRAN (R 4.0.1)                   
##  evaluate        0.14     2019-05-28 [1] CRAN (R 4.0.1)                   
##  fansi           0.4.1    2020-01-08 [1] CRAN (R 4.0.1)                   
##  glue            1.4.2    2020-08-27 [1] CRAN (R 4.0.1)                   
##  graph           1.68.0   2020-10-27 [1] Bioconductor                     
##  GSEABase        1.52.1   2020-12-11 [1] Bioconductor                     
##  htmltools       0.5.0    2020-06-16 [1] CRAN (R 4.0.1)                   
##  httr            1.4.2    2020-07-20 [1] CRAN (R 4.0.1)                   
##  IRanges         2.24.1   2020-12-12 [1] Bioconductor                     
##  knitr           1.30     2020-09-22 [1] CRAN (R 4.0.1)                   
##  lattice         0.20-41  2020-04-02 [1] CRAN (R 4.0.1)                   
##  magrittr        2.0.1    2020-11-17 [1] CRAN (R 4.0.1)                   
##  Matrix          1.3-2    2021-01-06 [1] CRAN (R 4.0.1)                   
##  memoise         1.1.0    2017-04-21 [1] CRAN (R 4.0.1)                   
##  R6              2.5.0    2020-10-28 [1] CRAN (R 4.0.1)                   
##  Rcpp            1.0.5    2020-07-06 [1] CRAN (R 4.0.1)                   
##  rlang           0.4.10   2020-12-30 [1] CRAN (R 4.0.1)                   
##  rmarkdown       2.6      2020-12-14 [1] CRAN (R 4.0.1)                   
##  RSQLite         2.2.1    2020-09-30 [1] CRAN (R 4.0.1)                   
##  S4Vectors       0.28.1   2020-12-09 [1] Bioconductor                     
##  sessioninfo     1.1.1    2018-11-05 [1] CRAN (R 4.0.1)                   
##  stringi         1.5.3    2020-09-09 [1] CRAN (R 4.0.1)                   
##  stringr         1.4.0    2019-02-10 [1] CRAN (R 4.0.1)                   
##  vctrs           0.3.6    2020-12-17 [1] CRAN (R 4.0.1)                   
##  withr           2.3.0    2020-09-22 [1] CRAN (R 4.0.1)                   
##  xfun            0.20     2021-01-06 [1] CRAN (R 4.0.1)                   
##  XML             3.99-0.5 2020-07-23 [1] CRAN (R 4.0.1)                   
##  xtable          1.8-4    2019-04-21 [1] CRAN (R 4.0.1)                   
##  yaml            2.2.1    2020-02-01 [1] CRAN (R 4.0.1)                   
## 
## [1] /home/lluis/bin/R/4.0.1/lib/R/library

R weekly new editor

Thu, 13 Feb 2020 00:00:00 +0000

Rweekly is looking for new editors. But they need to have submitted “at least 6 PRs on R Weekly”. If you submitted something through the webpage you also can apply. But I’ll look at how many people has submitted pull requests (PR) through GitHub at the repo rweekly/rewekly.

GH

So the GH package is good for this, but we need to know the API of Github. After a quick search I found the end point of the API:

library("gh")
PR <- gh("GET /search/issues?q=repo:rweekly/rweekly.org+is:pr+is:merged&per_page=100") # Copied from https://developer.github.com/v3/pulls/
PR$total_count
## [1] 706

We know that there have been 552, we’ll need 8 calls to the appy, because it returns 100 values on each call.

This time we’ll use copy and paste for a quick solution:

PR2 <- gh("GET /search/issues?q=repo:rweekly/rweekly.org+is:pr+is:merged&per_page=100&page=2")
PR3 <- gh("GET /search/issues?q=repo:rweekly/rweekly.org+is:pr+is:merged&per_page=100&page=3")
PR4 <- gh("GET /search/issues?q=repo:rweekly/rweekly.org+is:pr+is:merged&per_page=100&page=4")
PR5 <- gh("GET /search/issues?q=repo:rweekly/rweekly.org+is:pr+is:merged&per_page=100&page=5")
PR6 <- gh("GET /search/issues?q=repo:rweekly/rweekly.org+is:pr+is:merged&per_page=100&page=6")

Now that we have the data we need to retrive the user names:

data <- list(PR, PR2, PR3, PR4, PR5, PR6)

users <- lapply(data, function(x) {
  vapply(x$items, function(y) {y$user$login}, character(1L))
})
users <- sort(unlist(users))

We know now that 171 has contributed through PR. Which of them are done by at the same people?

ts <- sort(table(users), decreasing = TRUE)
par(mar = c(8,3,3,0))
barplot(ts, las = 2, border = "gray", main = "Contributors to Rweekly.org")

So we have 34 contributors which are ellegible, less if we remove the current editors:

names(ts)[ts >= 6]
##  [1] "Ryo-N7"          "HenrikBengtsson" "martinctc"       "maelle"         
##  [5] "amrrs"           "jwijffels"       "lgellis"         "mcdussault"     
##  [9] "malcolmbarrett"  "moldach"         "dA505819"        "echasnovski"    
## [13] "jonmcalder"      "jonocarroll"     "mailund"         "suzanbaert"     
## [17] "seabbs"          "feddelegrand7"   "hfshr"           "lorenzwalthert" 
## [21] "MilesMcBain"     "RaoOfPhysics"    "tomroh"          "EmilHvitfeldt"  
## [25] "katiejolly"      "privefl"         "rCarto"          "deanmarchiori"  
## [29] "DougVegas"       "eokodie"         "jdblischak"      "mkmiecik14"     
## [33] "noamross"        "rstub"

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.1 (2020-06-06)
##  os       Ubuntu 20.04.1 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Madrid               
##  date     2021-01-08                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.1)                   
##  blogdown      0.21.84 2021-01-07 [1] Github (rstudio/blogdown@c4fbb58)
##  bookdown      0.21    2020-10-13 [1] CRAN (R 4.0.1)                   
##  cli           2.2.0   2020-11-20 [1] CRAN (R 4.0.1)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.1)                   
##  curl          4.3     2019-12-02 [1] CRAN (R 4.0.1)                   
##  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.1)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.1)                   
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.1)                   
##  gh          * 1.2.0   2020-11-27 [1] CRAN (R 4.0.1)                   
##  gitcreds      0.1.1   2020-12-04 [1] CRAN (R 4.0.1)                   
##  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.1)                   
##  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.1)                   
##  httr          1.4.2   2020-07-20 [1] CRAN (R 4.0.1)                   
##  jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.0.1)                   
##  knitr         1.30    2020-09-22 [1] CRAN (R 4.0.1)                   
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.1)                   
##  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.1)                   
##  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.1)                   
##  rmarkdown     2.6     2020-12-14 [1] CRAN (R 4.0.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.1)                   
##  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.1)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.1)                   
##  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.1)                   
##  xfun          0.20    2021-01-06 [1] CRAN (R 4.0.1)                   
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.1)                   
## 
## [1] /home/lluis/bin/R/4.0.1/lib/R/library

Twitter bot

Tue, 13 Aug 2019 00:00:00 +0000

I was talking with a friend about social networks when he mentioned that it wasn’t worth his time to invest on podcasts. He said that I looked up his twitter account, that that’s more useful for him. This reminded me that I haven’t used these wonderful tools about twitter nor had I the motivation for analyzing time serie data.

This blogpost is my attempt to find how this user uses some kind of automated mechanism to publish.

library("rtweet")
user_tweets <- get_timeline(user, n = 180000, type = "mixed", 
                            include_rts = TRUE)

Now that we have the tweets we can look if he is a bot:

library("tweetbotornot") # from mkearney/tweetbotornot
# you might need to install this specific version of textfeatures:
# devtools::install_version('textfeatures', version='0.2.0')
botornot(user_tweets)

## [32m↪[39m [38;5;244mCounting features in text...[39m
## [32m↪[39m [38;5;244mSentiment analysis...[39m
## [32m↪[39m [38;5;244mParts of speech...[39m
## [32m↪[39m [38;5;244mWord dimensions started[39m
## [32m✔[39m Job's done!

## # A tibble: 1 x 3
##   screen_name    user_id   prob_bot
##   <chr>          <chr>        <dbl>
## 1 josemariasiota 288661791    0.386

It gives a very high probability.

We can visualize them with:

library("ggplot2")
ts_plot(user_tweets, "weeks") +
  theme_bw() +
  labs(title = "Tweets by @josemariasiota",
       subtitle = "Grouped by week", x = NULL, y = "tweets")

We can group the tweets by the source of them, if interactive or using some other service:

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

count(user_tweets, source, sort = TRUE)

## # A tibble: 9 x 2
##   source                               n
##   <chr>                            <int>
## 1 dlvr.it                           1738
## 2 twitterfeed                        676
## 3 Twitter Web Client                 553
## 4 Twitter Web App                    149
## 5 Twitter for iPhone                  78
## 6 Twitter for Advertisers (legacy)    21
## 7 Hootsuite                           13
## 8 Twitter for iPad                     2
## 9 Twitter for Websites                 2

user <- user_tweets %>% 
  mutate(source = case_when(
    grepl(" for | on | Web ", source) ~ "direct",
    TRUE ~ source
  ))

user %>% 
  count(source, sort = TRUE)

## # A tibble: 4 x 2
##   source          n
##   <chr>       <int>
## 1 dlvr.it      1738
## 2 direct        805
## 3 twitterfeed   676
## 4 Hootsuite      13

user <- user %>% 
  mutate(reply = case_when(
    is.na(reply_to_status_id) ~  "content?",
    TRUE ~ "reply"))
user %>% 
  count(reply, source, sort = TRUE)

## # A tibble: 5 x 3
##   reply    source          n
##   <chr>    <chr>       <int>
## 1 content? dlvr.it      1738
## 2 content? direct        731
## 3 content? twitterfeed   676
## 4 reply    direct         74
## 5 content? Hootsuite      13

library("stringr")
user <- user %>% 
  mutate(link = str_extract(text, "https?://.+\\b"),
         n_link = str_count(text, "https?://"),
         n_users = str_count(text, "@[:alnum:]+\\b"),
         n_hashtags = str_count(text, "#[:alnum:]+\\b"),
         via = str_count(text, "\\bvia\\b"))
user %>% count(n_link, reply, sort = TRUE)

## # A tibble: 7 x 3
##   n_link reply        n
##    <int> <chr>    <int>
## 1      1 content?  2508
## 2      2 content?   629
## 3      0 reply       57
## 4      0 content?    14
## 5      1 reply       14
## 6      3 content?     7
## 7      2 reply        3

user %>% 
  group_by(lang, source) %>% 
  summarise(n = n(), n_link = sum(n_link), n_users = sum(n_users), n_hashtags = sum(n_hashtags)) %>% 
  arrange(-n) %>% 
  ggplot() +
  geom_point(aes(lang, source, size = n)) +
  theme_bw()

## `summarise()` regrouping output by 'lang' (override with `.groups` argument)

We can see that depending on the service there are some languages that are not used.

We can visualize the tweets as they happen with:

user %>% 
  mutate(hms = hms::as_hms(created_at),
         d = as.Date(created_at)) %>% 
  ggplot(aes(d, hms, col = source, shape = reply)) +
  geom_point() +
  theme_bw() +
  labs(y = "Hour", x = "Date", title = "Tweets") +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y", 
               expand = c(0.01, 0)) +
  scale_y_time(labels = function(x) strftime(x, "%H"),
               breaks = hms::hms(seq(0, 24, 1)*60*60), expand = c(0.01, 0))

We can clearly see a change on the end of 2016, I will focus on that point forward.

A package that got my attention on twitter was anomalize which search for anomalies on time series of data. I hope that using this algorithm it will find when the data is not automated

library("anomalize")

## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>

The excellent guide at their website is easy to understand and follow

user <- user %>% 
  filter(created_at > as.Date("2016-11-01")) %>% 
  arrange(created_at) %>% 
  time_decompose(created_at, method = "stl", merge = TRUE, message = TRUE)

## Warning in mask$eval_all_filter(dots, env_filter): Incompatible methods
## ("Ops.POSIXt", "Ops.Date") for ">"

## Converting from tbl_df to tbl_time.
## Auto-index message: index = created_at

## frequency = 2 hours

## trend = 42.5 hours

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

user %>% 
  filter(created_at > as.Date("2016-11-01")) %>% 
  anomalize(remainder, method = "iqr") %>%
  time_recompose() %>%
  # Anomaly Visualization
  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.25) +
  labs(title = "User anomalies", 
       subtitle = "STL + IQR Methods", 
       x = "Time")

user %>% 
  filter(created_at > as.Date("2016-11-01")) %>% 
  anomalize(remainder, method = "iqr") %>%
  plot_anomaly_decomposition() +
  labs(title = "Decomposition of Anomalized Lubridate Downloads")

We can clearly see some tendencies on the tweeting so it is automated, since then. We can further check it with:

user %>% 
  filter(created_at > as.Date("2016-11-01")) %>% 
  botornot()

## [32m↪[39m [38;5;244mCounting features in text...[39m
## [32m↪[39m [38;5;244mSentiment analysis...[39m
## [32m↪[39m [38;5;244mParts of speech...[39m
## [32m↪[39m [38;5;244mWord dimensions started[39m
## [32m✔[39m Job's done!

## # A tibble: 1 x 3
##   screen_name    user_id   prob_bot
##   <chr>          <chr>        <dbl>
## 1 josemariasiota 288661791    0.469