Preprocessing of Data

It is rarely seen that data is ready for analysis as soon as it is collected. Data preprocessing is a crucial step in the modeling process. It involves cleaning, transforming, and encoding data to optimize it for further analysis. There is no one-step process to data preprocessing. It is an iterative process that involves multiple steps, and is highly dependent on the model you are building, the data you are working with and the objective you are trying to achieve.

Some issues that often needs to be addressed during data preprocessing are missing values, outliers, scaling and encoding categorical data, whereas some other issues are more specific to the model you are building, such as dimensionality reduction and data augmentation.

Missing data: The absence of a value in a feature is a common issue, and could arise from various reasons such as malfunction in measurement equipment or respondents not answering a question. Many models cannot handle missing data, and it is important to address this issue before proceeding with the analysis. One way would be to remove rows or columns with missing values, but this could lead to loss of valuable information. Otherwise, the missing values could be imputed with e.g. the mean.

Outliers: Outliers are values that are somehow different from the rest of the data. It can be discussed if some value is an outlier or just an extreme value. Outliers are usually values that should not be able to occur, and would therefore have a negative impact on the model. It is important to identify and remove outliers before proceeding with the analysis.

Data Transformation: For some data sets, the data could be skewed, which means that the data is not normally distributed, but some models assume that the data is normally distributed. A solution could be to take the logarithm or square root of the data. In the field of metagenomics, data is usually compositional, which means that the data is in the form of proportions, and the sum of the data is constant. To move the data from the simplex to the Euclidean space, the data could be transformed with the centered log-ratio transformation.

Scaling: Many machine learning models are sensitive to the scale of the data. For example, Principal Component Analysis finds variance along axis, and if the values along one axis are much larger than the values along another axis, the variance will be dominated by the first axis which is showcased in the chapter on PCA (12  Principal Component Analysis). This could lead to misleading results. It is therefore important to scale the data before applying such models. A common way to scale is by standardizing, which means that the data is transformed to have a mean of 0 and a standard deviation of 1. This is achieved by subtracting the mean and dividing by the standard deviation.

Encode categorical data: Categorical data could easily be string values, and many models cannot handle this. The issue is solved by encoding the data, e.g. with one-hot encoding. A categorical value with k categories is transformed into k binary features, where one is 1 and the others are 0.

Dimensionality reduction: Data of high dimensionality, usually with more features than observations can impose challenges for the analysis. Some models simply do not work with high-dimensional data such as OLS, whereas others could be computationally expensive or lead to overfitting. The more features, the more observations are needed to fill out the feature space. Dimensionality reduction is a way to reduce the number of features in the data, while still preserving the most important information. Principal Component Analysis and Uniform Manifold Approximation and Projection are common methods for dimensionality reduction. A simple feature selection also results in dimensionality reduction. All solution would also improve the computational costs.

Data Augmentation: When working with small data sets, the model could be overfitting. Data augmentation is used to increase the size of the data set by creating new data points from the existing data. When working with image data, the data set could be increased by rotating, flipping, or zooming the images. This is done to increase the size of the data set, which could lead to a more robust model.

In the two final chapters of this book (15  Linear Modelling and 16  Non-Linear Modelling), the disease status of individuals is predicted based on their gut microbiota. Such data is usually high-dimensional, and some of the data preprocessing steps mentioned above are crucial to obtain a good model. There is usually no missing values given the pipeline it was generated from, but the data is compositional and needs to be transformed. The data is also high-dimensional, and dimensionality reduction techniques are used to extract the important information from the data. Hence, the next chapters is on different dimensionality reduction techniques which are also available in the tidymodels package.