bioinformatics | B101nfo

experDesign: follow up

Sun, 09 Apr 2023 00:00:00 +0000

I am happy to announce a new release of experDesign. Install it from CRAN with:

install.packages("experDesign")
library("experDesign")

This new release has focused in more tricky aspects when designing an experiment:

Checking the samples of your experiment.
How to continue stratifying your conditions after some initial batch.

These functions should be used before carrying out anything once you have your samples collected. You can use these functions and make an informed decision of what might happen with your experiment.

Checking your samples

The new function check_data() will warn you if it finds some known issues with your data.

library("experDesign")
library("MASS")

If we take the survey dataset from the MASS package we can see that it has some issues:

data(survey, package = "MASS")
check_data(survey)
## Warning: Two categorical variables don't have all combinations.
## Warning: Some values are missing.
## Warning: There is a combination of categories with no replicates; i.e. just one
## sample.
## [1] FALSE

While if we fabricate our own dataset we might realize we have a problem

rdata <- expand.grid(sex = c("M", "F"), class = c("lower", "median", "high"))
stopifnot("Same samples/rows as combinations of classes" = nrow(rdata) == 2*3)
check_data(rdata)
## Warning: There is a combination of categories with no replicates; i.e. just one
## sample.
## [1] FALSE
# We create some new samples with the same conditions
rdata2 <- rbind(rdata, rdata)
check_data(rdata2)
## [1] TRUE

One might decide to go ahead with what is available or use only some of those samples or wait to collect more samples for the experiment

Follow up

Imagine you have 100 samples that you distribute in 4 batches of 25 samples each. Later, you collect 80 more samples to analyze. You want these new samples to be analyzed together with those previous 100 samples. Will it be possible? How should you distribute your new samples in groups of 25?

Using the same dataset from MASS imagine if we first collected 118 observations and later 119 more:

survey1 <- survey[1:118, ]
survey2 <- survey[119:nrow(survey), ]
# Using low number of iterations to speed the process 
# you should even use higher number than the default
fu <- follow_up(survey1, survey2, size_subset = 50, iterations = 10)
## Warning: There are some problems with the data.
## Warning: There are some problems with the new samples and the batches.
## Warning: There are some problems with the new data.
## Warning: There are some problems with the old data.

Following the previous new function it reports if there are problems with the observations. One can check each collection with check_data to know more about the problems found.

If you have already performed the experiment on your observations you can also check the distribution:

# Create the first batch
variables <- c("Sex", "Smoke", "Age")
survey1 <- survey1[, variables]
index1 <- design(survey1, size_subset = 50, iterations = 10)
## Warning: There might be some problems with the data use check_data().
r_survey <- inspect(index1, survey1)
# Create the second batch with "new" students
survey2 <- survey2[, variables]
survey2$batch <- NA
# Prepare the follow up
all_classroom <- rbind(r_survey, survey2)
fu2 <- follow_up2(all_classroom, size_subset = 50, iterations = 10)
## Warning: There are some problems with the data.
## Warning: There are some problems with the new samples and the batches.
## Warning: There are some problems with the new data.
## Warning: There are some problems with the old data.
tail(fu2)
## [1] "NewSubset2" "NewSubset2" "NewSubset2" "NewSubset2" "NewSubset2"
## [6] "NewSubset3"

Using this function will help to decide which new observations go to which new batches.

Closing remarks

The famous quote from Fisher goes:

“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”

This emphasizes the importance of involving a statistician early on in the experimental design process.
Unfortunately, in some cases, it may be too late to involve a statistician in the experimental design process or the reality of unforeseen circumstances messed the design of your carefully planned experiment.

My aim with this package is to provide practical tools for statisticians, bioinformaticians, and anyone who works with data. These tools are designed to be easy to use and can be used to analyze data in a variety of contexts. Let me know if it is helpful in your case.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.2.2 (2022-10-31)
##  os       Ubuntu 22.04.2 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language en_US
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Madrid
##  date     2023-04-09
##  pandoc   2.19.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version  date (UTC) lib source
##  blogdown      1.16     2022-12-13 [1] CRAN (R 4.2.2)
##  bookdown      0.33     2023-03-06 [1] CRAN (R 4.2.2)
##  bslib         0.4.2    2022-12-16 [1] CRAN (R 4.2.2)
##  cachem        1.0.7    2023-02-24 [1] CRAN (R 4.2.2)
##  cli           3.6.1    2023-03-23 [1] CRAN (R 4.2.2)
##  digest        0.6.31   2022-12-11 [1] CRAN (R 4.2.2)
##  evaluate      0.20     2023-01-17 [1] CRAN (R 4.2.2)
##  experDesign * 0.2.0    2023-04-05 [1] CRAN (R 4.2.2)
##  fastmap       1.1.1    2023-02-24 [1] CRAN (R 4.2.2)
##  htmltools     0.5.4    2022-12-07 [1] CRAN (R 4.2.2)
##  jquerylib     0.1.4    2021-04-26 [1] CRAN (R 4.2.2)
##  jsonlite      1.8.4    2022-12-06 [1] CRAN (R 4.2.2)
##  knitr         1.42     2023-01-25 [1] CRAN (R 4.2.2)
##  MASS        * 7.3-58.1 2022-08-03 [2] CRAN (R 4.2.2)
##  R6            2.5.1    2021-08-19 [1] CRAN (R 4.2.2)
##  rlang         1.1.0    2023-03-14 [1] CRAN (R 4.2.2)
##  rmarkdown     2.20     2023-01-19 [1] CRAN (R 4.2.2)
##  rstudioapi    0.14     2022-08-22 [1] CRAN (R 4.2.2)
##  sass          0.4.5    2023-01-24 [1] CRAN (R 4.2.2)
##  sessioninfo   1.2.2    2021-12-06 [1] CRAN (R 4.2.2)
##  xfun          0.37     2023-01-31 [1] CRAN (R 4.2.2)
##  yaml          2.3.7    2023-01-23 [1] CRAN (R 4.2.2)
## 
##  [1] /home/lluis/bin/R/4.2.2
##  [2] /opt/R/4.2.2/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Starting as a pet bioinformatician

Thu, 26 Jan 2023 00:00:00 +0000

In my other blog, I posted some views as a pet bioinformatician. I am starting in a new position but with more experience, I think it will help me and other people in the same position.

Background

In my last job and my current work, I am at a research center close to hospitals (This one is so close that we come in by the maternity emergency door). These research institutes are not tied to a university so there isn’t any teaching associated with the positions, although a few people teach at one university or the other. The research is translational meaning that most of the wet lab work is done with primary cultures (with samples of patients).

In Spain, health data is specially protected and mostly electronic in different systems. Despite the high level of protection due to multiple causes, that I don’t want to write about now, there have been some hackers that blocked hospitals or networks related to healthcare institutions. This led to an increased level of security concerns from the IT teams.

At the same time tweets, like this show that accessing the data between hospitals and healthcare systems is still very difficult:

An example of how difficult is to use the data: A researcher was using the hospital health care service to check the progress through the hospital to know when to go to collect a sample from a patient. At some point, they realized the patient was behind the schedule so they waited to see what happened. Some minutes later, close to an hour later, we received a call from the doctor asking us to close the windows check-in the patient data so that they could access it and start the procedure.

I think that doctors spent more time filling and navigating the information system than they should. At the same time, I recognize that moving to a newer better(?) system is not easy. The It team is caught by the daily struggles to keep old Windows NT working for specific machines.

Organization

You might be wondering why I start describing the situation, and the answer is very simple. Some companies might be using deep learning, AI, ML while your university might have taught you how to use, write and analyze that. You will be dealing with many mundane tasks. It is usually said that 90% of analysis time is spent cleaning data. Similarly, 90% of your job as a bioinformatician will be making sense of the chaos around the data you need.

To convert this chaos into something valuable you must know what your team expects from you. It can be from, I need a data guy or a biostatistician for my projects; to something more specific, like being the system operator of their server or cloud services. As with any role, your responsibilities might be flexible and you could be told something at the interview but later on, asked to do something different.

When you start in a place, learn what is available. There might be a bioinformatician, genomic or transcriptome service, a data management office, and for sure an IT department that you’ll need to work with to get what you need. At the same time, they might have some agreements with other research institutes or facilities.

Ask around and you will find out.

Figure 1: Adaption of a meme. Original by Roger Skaer.

When you ask around you will get to know some important names. Some of them might be the IT or a boss you need approval from them to get what you want. Other key people will be the one that is a reference for other people at your institution for this kind of thing, or ones with more experience and prestige. Meet them and ask what is the best way to cover what you need. You might need to compromise and you will probably be far from the ideal position. But start changing for the better and with time it will come.

Computer

One of the first fights you’ll have is having a computer. If you are lucky you might inherit a good computer from someone that previously had your position or some similar position. If you receive a standard office computer you are in trouble: you will need to pitch which software and probably which hardware might cover your needs. Which might be a problem if you are the first in this position.

You can use cloud services but then you’ll need to be aware of the costs and setup of said services. Probably there will be some kind of cheaper option in your institution or on campus but it might not fit your needs or it might be hard to reach. If you go the local route you’ll have to decide which workstation or server you need to buy. I wish I had learned how to decide which computer is better.

If you want to buy a computer you’ll need to know if there is any kind of deal with a specific company: Some institutions have a deal with a provider, DELL, Lenovo, HP. If your requirements are important enough you might be able to buy from a different provider.

If you require some specific hardware not available by your provider you’ll need to find which shop to buy. Find a company that delivers to your country at reasonable prices. You probably could get cheaper machines if you built your computer but that depends on how familiar are you with building computers and how sure you are all the components are compatible.

Future

By the time you leave that position, you might need to pass down the maintenance of the hardware, the cloud structure you set up, or the code you wrote. Document everything and be ready for a bumpy road to pass the reigns of your kingdom.

Think of yourself as the chief technology officer of a small start-up. It is up to you to decide how you organize the computer requirements, which programs you’ll need, how you will deliver results… In the end, this will be a great experience as you’ll learn multiple things and grow to fill multiple positions.