Applied Biostatistical Methods

Load packages.

Show the code
library("tidymodels")
tidymodels::tidymodels_prefer()

In this part, each of the following parts delve into an area of biostatistics. This includes a chapter on diversity measures, hypothesis testing and several chapters on models such as Ordinary Linear Regression. Each chapter is structured in a similar way. First, a commonly occurring problem is introduced, followed by an explanation of the method and how it solves the problem. Then, the math behind the method is then explained, and lastly, a visual representation. Deviations from this structure may occur when e.g. a text based introduction is insufficient.

In the first part of the book, the iris data set was used to exemplify the usage of the tidymodels package. Some of the following chapters reuses the iris data ses wheras some uses the also common mtcars data set. For the chapters pertaining to diversity measures, a count matrix from a metagenomics study is used instead. The raw data can be obtained from (Cox et al. 2021). Each row is a sample from a person, each column is a genus and the values represent the abundance of said genus in said sample. Explaining how the count matrix is acquired from raw sequences is out of scope for this book. Scripts used for the preprocessing exists in the GitHub repository from which this book is created, located in the src/ sub-folder.

A snippet of the data set can be seen below:

count_matrix <- readr::read_rds("https://github.com/WilliamH-R/BioStatistics/raw/main/data/count_matrix/count_matrix.rds")
count_matrix
# A tibble: 456 × 124
   Sample      Actinomyces Adlercreutzia Agathobacter Akkermansia Alistipes
   <chr>             <int>         <int>        <int>       <int>     <int>
 1 SRR14214860           0            11          190         486       272
 2 SRR14214861           0            12            0           5      1158
 3 SRR14214862           0             0            0           0         0
 4 SRR14214863           4             0          505          94       361
 5 SRR14214864           0            45          744           3       794
 6 SRR14214865           0            29          924         933        45
 7 SRR14214866           0            41         1187         308       145
 8 SRR14214867           0            45          648         323       193
 9 SRR14214868           0             0         6973           0       287
10 SRR14214869           0             0          362         270        30
# ℹ 446 more rows
# ℹ 118 more variables: Anaerofustis <int>, Anaerostipes <int>,
#   Anaerotruncus <int>, Angelakisella <int>, Bacteroides <int>,
#   Barnesiella <int>, Bifidobacterium <int>, Bilophila <int>, Blautia <int>,
#   Butyricicoccus <int>, Butyricimonas <int>, `CAG-352` <int>, `CAG-56` <int>,
#   `Candidatus Soleaferrea` <int>, `Candidatus Stoquefichus` <int>,
#   Catenibacillus <int>, Catenibacterium <int>, …

To speed up data processing, and to make results less overwhelming, a subset of the count matrix is used at times as test data when exemplifying.

count_matrix_test <- count_matrix |> 
  slice_head(n = 10) |> 
  select(Sample, Actinomyces,
         Adlercreutzia, Agathobacter, Akkermansia)
count_matrix_test
# A tibble: 10 × 5
   Sample      Actinomyces Adlercreutzia Agathobacter Akkermansia
   <chr>             <int>         <int>        <int>       <int>
 1 SRR14214860           0            11          190         486
 2 SRR14214861           0            12            0           5
 3 SRR14214862           0             0            0           0
 4 SRR14214863           4             0          505          94
 5 SRR14214864           0            45          744           3
 6 SRR14214865           0            29          924         933
 7 SRR14214866           0            41         1187         308
 8 SRR14214867           0            45          648         323
 9 SRR14214868           0             0         6973           0
10 SRR14214869           0             0          362         270

Microbiome data is inherently compositional, meaning that the sum of all genera in a sample is constant, so the count numbers indicate the proportion of a specific genus. As a consequence of compositional data, commonly used statistical tools are not applicable. To avoid this compositionality, the count matrix is transformed using the centered log-ratio (clr) transformation.

count_matrix_clr <- readr::read_rds("https://github.com/WilliamH-R/BioStatistics/raw/main/data/count_matrix/count_matrix_clr.rds")
count_matrix_clr
# A tibble: 456 × 124
   Sample      Actinomyces Adlercreutzia Agathobacter Akkermansia Alistipes
   <chr>             <dbl>         <dbl>        <dbl>       <dbl>     <dbl>
 1 SRR14214860        0           -1.83          1.02       1.96      1.38 
 2 SRR14214861        0           -2.11          0         -2.98      2.46 
 3 SRR14214862        0            0             0          0         0    
 4 SRR14214863       -3.41         0             1.43      -0.255     1.09 
 5 SRR14214864        0           -0.495         2.31      -3.20      2.38 
 6 SRR14214865        0           -1.52          1.94       1.95     -1.08 
 7 SRR14214866        0           -0.835         2.53       1.18      0.428
 8 SRR14214867        0           -0.581         2.09       1.39      0.876
 9 SRR14214868        0            0             3.85       0         0.658
10 SRR14214869        0            0             1.39       1.10     -1.10 
# ℹ 446 more rows
# ℹ 118 more variables: Anaerofustis <dbl>, Anaerostipes <dbl>,
#   Anaerotruncus <dbl>, Angelakisella <dbl>, Bacteroides <dbl>,
#   Barnesiella <dbl>, Bifidobacterium <dbl>, Bilophila <dbl>, Blautia <dbl>,
#   Butyricicoccus <dbl>, Butyricimonas <dbl>, `CAG-352` <dbl>, `CAG-56` <dbl>,
#   `Candidatus Soleaferrea` <dbl>, `Candidatus Stoquefichus` <dbl>,
#   Catenibacillus <dbl>, Catenibacterium <dbl>, …

Session Info

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.3 (2024-02-29 ucrt)
 os       Windows 11 x64 (build 22631)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United Kingdom.utf8
 ctype    English_United Kingdom.utf8
 tz       Europe/Copenhagen
 date     2024-05-30
 pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package      * version    date (UTC) lib source
 backports      1.4.1      2021-12-13 [1] CRAN (R 4.3.1)
 broom        * 1.0.5      2023-06-09 [1] CRAN (R 4.3.3)
 cachem         1.0.8      2023-05-01 [1] CRAN (R 4.3.3)
 class          7.3-22     2023-05-03 [2] CRAN (R 4.3.3)
 cli            3.6.2      2023-12-11 [1] CRAN (R 4.3.3)
 codetools      0.2-19     2023-02-01 [2] CRAN (R 4.3.3)
 colorspace     2.1-0      2023-01-23 [1] CRAN (R 4.3.3)
 conflicted     1.2.0      2023-02-01 [1] CRAN (R 4.3.3)
 data.table     1.15.4     2024-03-30 [1] CRAN (R 4.3.3)
 dials        * 1.2.1      2024-02-22 [1] CRAN (R 4.3.3)
 DiceDesign     1.10       2023-12-07 [1] CRAN (R 4.3.3)
 digest         0.6.35     2024-03-11 [1] CRAN (R 4.3.3)
 dplyr        * 1.1.4      2023-11-17 [1] CRAN (R 4.3.2)
 evaluate       0.23       2023-11-01 [1] CRAN (R 4.3.3)
 fansi          1.0.6      2023-12-08 [1] CRAN (R 4.3.3)
 fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.3.3)
 foreach        1.5.2      2022-02-02 [1] CRAN (R 4.3.3)
 furrr          0.3.1      2022-08-15 [1] CRAN (R 4.3.3)
 future         1.33.2     2024-03-26 [1] CRAN (R 4.3.3)
 future.apply   1.11.2     2024-03-28 [1] CRAN (R 4.3.3)
 generics       0.1.3      2022-07-05 [1] CRAN (R 4.3.3)
 ggplot2      * 3.5.1      2024-04-23 [1] CRAN (R 4.3.3)
 globals        0.16.3     2024-03-08 [1] CRAN (R 4.3.3)
 glue           1.7.0      2024-01-09 [1] CRAN (R 4.3.3)
 gower          1.0.1      2022-12-22 [1] CRAN (R 4.3.1)
 GPfit          1.0-8      2019-02-08 [1] CRAN (R 4.3.3)
 gtable         0.3.5      2024-04-22 [1] CRAN (R 4.3.3)
 hardhat        1.3.1      2024-02-02 [1] CRAN (R 4.3.3)
 hms            1.1.3      2023-03-21 [1] CRAN (R 4.3.3)
 htmltools      0.5.8.1    2024-04-04 [1] CRAN (R 4.3.3)
 htmlwidgets    1.6.4      2023-12-06 [1] CRAN (R 4.3.3)
 infer        * 1.0.7      2024-03-25 [1] CRAN (R 4.3.3)
 ipred          0.9-14     2023-03-09 [1] CRAN (R 4.3.3)
 iterators      1.0.14     2022-02-05 [1] CRAN (R 4.3.3)
 jsonlite       1.8.8      2023-12-04 [1] CRAN (R 4.3.3)
 knitr          1.46       2024-04-06 [1] CRAN (R 4.3.3)
 lattice        0.22-5     2023-10-24 [2] CRAN (R 4.3.3)
 lava           1.8.0      2024-03-05 [1] CRAN (R 4.3.3)
 lhs            1.1.6      2022-12-17 [1] CRAN (R 4.3.3)
 lifecycle      1.0.4      2023-11-07 [1] CRAN (R 4.3.3)
 listenv        0.9.1      2024-01-29 [1] CRAN (R 4.3.3)
 lubridate      1.9.3      2023-09-27 [1] CRAN (R 4.3.3)
 magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.3.3)
 MASS           7.3-60.0.1 2024-01-13 [2] CRAN (R 4.3.3)
 Matrix         1.6-5      2024-01-11 [2] CRAN (R 4.3.3)
 memoise        2.0.1      2021-11-26 [1] CRAN (R 4.3.3)
 modeldata    * 1.3.0      2024-01-21 [1] CRAN (R 4.3.3)
 munsell        0.5.1      2024-04-01 [1] CRAN (R 4.3.3)
 nnet           7.3-19     2023-05-03 [2] CRAN (R 4.3.3)
 parallelly     1.37.1     2024-02-29 [1] CRAN (R 4.3.3)
 parsnip      * 1.2.1      2024-03-22 [1] CRAN (R 4.3.3)
 pillar         1.9.0      2023-03-22 [1] CRAN (R 4.3.3)
 pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.3.3)
 prodlim        2023.08.28 2023-08-28 [1] CRAN (R 4.3.3)
 purrr        * 1.0.2      2023-08-10 [1] CRAN (R 4.3.3)
 R6             2.5.1      2021-08-19 [1] CRAN (R 4.3.3)
 Rcpp           1.0.12     2024-01-09 [1] CRAN (R 4.3.3)
 readr          2.1.5      2024-01-10 [1] CRAN (R 4.3.3)
 recipes      * 1.0.10     2024-02-18 [1] CRAN (R 4.3.3)
 rlang          1.1.3      2024-01-10 [1] CRAN (R 4.3.3)
 rmarkdown      2.26       2024-03-05 [1] CRAN (R 4.3.3)
 rpart          4.1.23     2023-12-05 [2] CRAN (R 4.3.3)
 rsample      * 1.2.1      2024-03-25 [1] CRAN (R 4.3.3)
 rstudioapi     0.16.0     2024-03-24 [1] CRAN (R 4.3.3)
 scales       * 1.3.0      2023-11-28 [1] CRAN (R 4.3.3)
 sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.3.3)
 survival       3.5-8      2024-02-14 [2] CRAN (R 4.3.3)
 tibble       * 3.2.1      2023-03-20 [1] CRAN (R 4.3.3)
 tidymodels   * 1.2.0      2024-03-25 [1] CRAN (R 4.3.3)
 tidyr        * 1.3.1      2024-01-24 [1] CRAN (R 4.3.3)
 tidyselect     1.2.1      2024-03-11 [1] CRAN (R 4.3.3)
 timechange     0.3.0      2024-01-18 [1] CRAN (R 4.3.3)
 timeDate       4032.109   2023-12-14 [1] CRAN (R 4.3.2)
 tune         * 1.2.1      2024-04-18 [1] CRAN (R 4.3.3)
 tzdb           0.4.0      2023-05-12 [1] CRAN (R 4.3.3)
 utf8           1.2.4      2023-10-22 [1] CRAN (R 4.3.3)
 vctrs          0.6.5      2023-12-01 [1] CRAN (R 4.3.3)
 withr          3.0.0      2024-01-16 [1] CRAN (R 4.3.3)
 workflows    * 1.1.4      2024-02-19 [1] CRAN (R 4.3.3)
 workflowsets * 1.1.0      2024-03-21 [1] CRAN (R 4.3.3)
 xfun           0.43       2024-03-25 [1] CRAN (R 4.3.3)
 yaml           2.3.8      2023-12-11 [1] CRAN (R 4.3.2)
 yardstick    * 1.3.1      2024-03-21 [1] CRAN (R 4.3.3)

 [1] C:/Users/Willi/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.3/library

──────────────────────────────────────────────────────────────────────────────