BioStatistics

Author

William Hagedorn-Rasmussen

Published

July 5, 2026

Preface

This book introduces the concept of Tidy Modelling through the package tidymodels, and is designed to be a practical guide to the tools and techniques that are used in the Tidy Modelling framework. It is aimed at people who are familiar with the basics of R and tidyverse and are interested in learning how to use the tidymodels package to build and evaluate models. Further, a series of statistical concepts often used in Bioinformatics and Biostatistics are introduced in the context of Tidy Modelling.

To cover all the topics, the book is divided into four parts. The first part introduces the tidymodels package and its core principles such as data splitting, pre-processing, model building, and evaluation. The second part introduces some statistical concepts such as hypothesis testing, linear regression, logistic regression, and diversity measures. The concepts are introduced in a structured manner. A problem is introduced followed by a description of the method and how it solves the introduced problem. Then, the math behind the method is introduced and lastly, visual representation. Deviations from this structure are made when necessary. The third part contain some commonly applied preprocessing techniques useful when analysing data with many variables as they focus on dimensionality reduction.

The final part combines the concepts introduced in the first three parts to build and evaluate models for a real-world dataset. A study pertaining to microbiome and disease prediction is used. The raw data are 16S rRNA amplicon reads used to describe multiple sclerosis (MS) (Cox et al. 2021). The samples were pre-processed using the pipeline described in the GitHub repository from which this book is created. After preprocessing, a count matrix was obtained with 456 observations (samples), and the abundance of 123 genera. For each observation, the sex and age of the person is known among other information. All of these features are used to build several models trying to predict the disease status of the person.