Show the code
library("tidymodels")
tidymodels::tidymodels_prefer()Load packages.
library("tidymodels")
tidymodels::tidymodels_prefer()In this part, each of the following parts delve into an area of biostatistics. This includes a chapter on diversity measures, hypothesis testing and several chapters on models such as Ordinary Linear Regression. Each chapter is structured in a similar way. First, a commonly occurring problem is introduced, followed by an explanation of the method and how it solves the problem. Then, the math behind the method is then explained, and lastly, a visual representation. Deviations from this structure may occur when e.g. a text based introduction is insufficient.
In the first part of the book, the iris data set was used to exemplify the usage of the tidymodels package. Some of the following chapters reuses the iris data ses wheras some uses the also common mtcars data set. For the chapters pertaining to diversity measures, a count matrix from a metagenomics study is used instead. The raw data can be obtained from (Cox et al. 2021). Each row is a sample from a person, each column is a genus and the values represent the abundance of said genus in said sample. Explaining how the count matrix is acquired from raw sequences is out of scope for this book. Scripts used for the preprocessing exists in the GitHub repository from which this book is created, located in the src/ sub-folder.
A snippet of the data set can be seen below:
count_matrix <- readr::read_rds("https://github.com/WilliamH-R/BioStatistics/raw/main/data/count_matrix/count_matrix.rds")
count_matrix# A tibble: 456 × 124
Sample Actinomyces Adlercreutzia Agathobacter Akkermansia Alistipes
<chr> <int> <int> <int> <int> <int>
1 SRR14214860 0 11 190 486 272
2 SRR14214861 0 12 0 5 1158
3 SRR14214862 0 0 0 0 0
4 SRR14214863 4 0 505 94 361
5 SRR14214864 0 45 744 3 794
6 SRR14214865 0 29 924 933 45
7 SRR14214866 0 41 1187 308 145
8 SRR14214867 0 45 648 323 193
9 SRR14214868 0 0 6973 0 287
10 SRR14214869 0 0 362 270 30
# ℹ 446 more rows
# ℹ 118 more variables: Anaerofustis <int>, Anaerostipes <int>,
# Anaerotruncus <int>, Angelakisella <int>, Bacteroides <int>,
# Barnesiella <int>, Bifidobacterium <int>, Bilophila <int>, Blautia <int>,
# Butyricicoccus <int>, Butyricimonas <int>, `CAG-352` <int>, `CAG-56` <int>,
# `Candidatus Soleaferrea` <int>, `Candidatus Stoquefichus` <int>,
# Catenibacillus <int>, Catenibacterium <int>, …
To speed up data processing, and to make results less overwhelming, a subset of the count matrix is used at times as test data when exemplifying.
count_matrix_test <- count_matrix |>
slice_head(n = 10) |>
select(Sample, Actinomyces,
Adlercreutzia, Agathobacter, Akkermansia)
count_matrix_test# A tibble: 10 × 5
Sample Actinomyces Adlercreutzia Agathobacter Akkermansia
<chr> <int> <int> <int> <int>
1 SRR14214860 0 11 190 486
2 SRR14214861 0 12 0 5
3 SRR14214862 0 0 0 0
4 SRR14214863 4 0 505 94
5 SRR14214864 0 45 744 3
6 SRR14214865 0 29 924 933
7 SRR14214866 0 41 1187 308
8 SRR14214867 0 45 648 323
9 SRR14214868 0 0 6973 0
10 SRR14214869 0 0 362 270
Microbiome data is inherently compositional, meaning that the sum of all genera in a sample is constant, so the count numbers indicate the proportion of a specific genus. As a consequence of compositional data, commonly used statistical tools are not applicable. To avoid this compositionality, the count matrix is transformed using the centered log-ratio (clr) transformation.
count_matrix_clr <- readr::read_rds("https://github.com/WilliamH-R/BioStatistics/raw/main/data/count_matrix/count_matrix_clr.rds")
count_matrix_clr# A tibble: 456 × 124
Sample Actinomyces Adlercreutzia Agathobacter Akkermansia Alistipes
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SRR14214860 0 -1.83 1.02 1.96 1.38
2 SRR14214861 0 -2.11 0 -2.98 2.46
3 SRR14214862 0 0 0 0 0
4 SRR14214863 -3.41 0 1.43 -0.255 1.09
5 SRR14214864 0 -0.495 2.31 -3.20 2.38
6 SRR14214865 0 -1.52 1.94 1.95 -1.08
7 SRR14214866 0 -0.835 2.53 1.18 0.428
8 SRR14214867 0 -0.581 2.09 1.39 0.876
9 SRR14214868 0 0 3.85 0 0.658
10 SRR14214869 0 0 1.39 1.10 -1.10
# ℹ 446 more rows
# ℹ 118 more variables: Anaerofustis <dbl>, Anaerostipes <dbl>,
# Anaerotruncus <dbl>, Angelakisella <dbl>, Bacteroides <dbl>,
# Barnesiella <dbl>, Bifidobacterium <dbl>, Bilophila <dbl>, Blautia <dbl>,
# Butyricicoccus <dbl>, Butyricimonas <dbl>, `CAG-352` <dbl>, `CAG-56` <dbl>,
# `Candidatus Soleaferrea` <dbl>, `Candidatus Stoquefichus` <dbl>,
# Catenibacillus <dbl>, Catenibacterium <dbl>, …
sessioninfo::session_info()─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.3 (2024-02-29 ucrt)
os Windows 11 x64 (build 22631)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United Kingdom.utf8
ctype English_United Kingdom.utf8
tz Europe/Copenhagen
date 2024-05-30
pandoc 3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
backports 1.4.1 2021-12-13 [1] CRAN (R 4.3.1)
broom * 1.0.5 2023-06-09 [1] CRAN (R 4.3.3)
cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.3)
class 7.3-22 2023-05-03 [2] CRAN (R 4.3.3)
cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.3)
codetools 0.2-19 2023-02-01 [2] CRAN (R 4.3.3)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.3)
conflicted 1.2.0 2023-02-01 [1] CRAN (R 4.3.3)
data.table 1.15.4 2024-03-30 [1] CRAN (R 4.3.3)
dials * 1.2.1 2024-02-22 [1] CRAN (R 4.3.3)
DiceDesign 1.10 2023-12-07 [1] CRAN (R 4.3.3)
digest 0.6.35 2024-03-11 [1] CRAN (R 4.3.3)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2)
evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.3)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.3)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.3)
foreach 1.5.2 2022-02-02 [1] CRAN (R 4.3.3)
furrr 0.3.1 2022-08-15 [1] CRAN (R 4.3.3)
future 1.33.2 2024-03-26 [1] CRAN (R 4.3.3)
future.apply 1.11.2 2024-03-28 [1] CRAN (R 4.3.3)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.3)
ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.3.3)
globals 0.16.3 2024-03-08 [1] CRAN (R 4.3.3)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.3)
gower 1.0.1 2022-12-22 [1] CRAN (R 4.3.1)
GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.3.3)
gtable 0.3.5 2024-04-22 [1] CRAN (R 4.3.3)
hardhat 1.3.1 2024-02-02 [1] CRAN (R 4.3.3)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.3)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.3.3)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.3)
infer * 1.0.7 2024-03-25 [1] CRAN (R 4.3.3)
ipred 0.9-14 2023-03-09 [1] CRAN (R 4.3.3)
iterators 1.0.14 2022-02-05 [1] CRAN (R 4.3.3)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.3)
knitr 1.46 2024-04-06 [1] CRAN (R 4.3.3)
lattice 0.22-5 2023-10-24 [2] CRAN (R 4.3.3)
lava 1.8.0 2024-03-05 [1] CRAN (R 4.3.3)
lhs 1.1.6 2022-12-17 [1] CRAN (R 4.3.3)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.3)
listenv 0.9.1 2024-01-29 [1] CRAN (R 4.3.3)
lubridate 1.9.3 2023-09-27 [1] CRAN (R 4.3.3)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.3)
MASS 7.3-60.0.1 2024-01-13 [2] CRAN (R 4.3.3)
Matrix 1.6-5 2024-01-11 [2] CRAN (R 4.3.3)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.3)
modeldata * 1.3.0 2024-01-21 [1] CRAN (R 4.3.3)
munsell 0.5.1 2024-04-01 [1] CRAN (R 4.3.3)
nnet 7.3-19 2023-05-03 [2] CRAN (R 4.3.3)
parallelly 1.37.1 2024-02-29 [1] CRAN (R 4.3.3)
parsnip * 1.2.1 2024-03-22 [1] CRAN (R 4.3.3)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.3)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.3)
prodlim 2023.08.28 2023-08-28 [1] CRAN (R 4.3.3)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.3)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.3)
Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.3.3)
readr 2.1.5 2024-01-10 [1] CRAN (R 4.3.3)
recipes * 1.0.10 2024-02-18 [1] CRAN (R 4.3.3)
rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.3)
rmarkdown 2.26 2024-03-05 [1] CRAN (R 4.3.3)
rpart 4.1.23 2023-12-05 [2] CRAN (R 4.3.3)
rsample * 1.2.1 2024-03-25 [1] CRAN (R 4.3.3)
rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.3.3)
scales * 1.3.0 2023-11-28 [1] CRAN (R 4.3.3)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.3)
survival 3.5-8 2024-02-14 [2] CRAN (R 4.3.3)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.3)
tidymodels * 1.2.0 2024-03-25 [1] CRAN (R 4.3.3)
tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.3)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.3)
timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.3)
timeDate 4032.109 2023-12-14 [1] CRAN (R 4.3.2)
tune * 1.2.1 2024-04-18 [1] CRAN (R 4.3.3)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.3)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.3)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.3)
withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.3)
workflows * 1.1.4 2024-02-19 [1] CRAN (R 4.3.3)
workflowsets * 1.1.0 2024-03-21 [1] CRAN (R 4.3.3)
xfun 0.43 2024-03-25 [1] CRAN (R 4.3.3)
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2)
yardstick * 1.3.1 2024-03-21 [1] CRAN (R 4.3.3)
[1] C:/Users/Willi/AppData/Local/R/win-library/4.3
[2] C:/Program Files/R/R-4.3.3/library
──────────────────────────────────────────────────────────────────────────────