Written by Vered Smith, Science Writer
A recent review has detailed five common pitfalls that occur when applying machine learning in genomics, and how to address them.
Machine learning (ML) is used to analyse genomic, epigenomic, transcriptomic, and proteomic data. However, the models and evaluations used in ML software are created with assumptions that are often not met in a biological context. This review, published in Nature Review Genetics, provided solutions on how to tackle these pitfalls to allow accurate application of supervised learning in genomics research.
Pitfall 1: Distributional Differences
Examples used for analysis might not be identically distributed, meaning that the probability of observing one value in a specific sample is not the same as observing it in another sample. In genomics, this may be for a few reasons. Firstly, biological structures are often different. For example, euchromatin and heterochromatin have different epigenetic profiles. Secondly, this may occur because of different biological contexts between two datasets, such as different cell types or different species. Thirdly, different study designs or technical factors can affect the mean and variability of a dataset.
When the training and test set have one distribution, but the prediction set has another distribution, the relationships between features and outcomes learnt may not be true for the predictive set. Assessing the marginal distribution of the features through visualisation, statistical tests, or using model-based techniques that detect outliers and anomalies can help identify distributional differences. There is ongoing research into how to deal with distributional differences, with possible strategies including batch correction methods and adversarial learning.
Pitfall 2: Dependent Examples
Common ML models and cross-validation assume that the values of each example are independent. However, often samples are dependent. In genomics, this can occur in interactions, such as between regulators and genes, or enhancers and promoters. it could even be two genomic loci in linkage disequilibrium.
To recognise if data contains this pitfall, the reviewers recommended examining data for inherent dependencies before applying ML tools. Using Cytoscape, R or Python to transfer tabular data into specific graphs with nodes representing biological entities can help identify connected nodes as dependent on each other.
To solve this problem, the best option may be taking into account dependence and mitigating overfitting when evaluating the model. Group k-fold cross-validation, also known as blocking, can address some non-complex examples. A third solution would be to use a method that models covariance between examples, such as biostatistical mixed effects models. Lastly, if there are specific features causing dependence, it may be possible to reformulate the problem with less dependence.
Pitfall 3: Confounding
A confounder is an unmeasured or artefactual variable that either creates or masks associations with an outcome. This is one of the hardest pitfalls to identify. It happens because the confounder induces dependence between the features and the outcome, but is thought to be insignificant so is not included in the model. When the model is applied to an example where the confounder is missing, or distributed differently, this can lead to incorrect interpretations. There are many confounders in biology that are hard to identify. For example, predicting 3D chromatin interaction data from epigenetic features is confounded by genomic distance between loci.
Cross-validation cannot protect against confounding, because it is present in the training and the test sets. What can be done is randomising examples with respect to expected confounders, such as experimental batches. If this is not possible, statistical approaches such as principal components that visualise data in high dimensions can capture unmeasured confounders, or measure potential confounding variables. Both these options could then allow the inclusion of the variable in the ML model, which will reduce its effect.
Pitfall 4: Leaky Preprocessing
The fourth pitfall was data leaking from the training set into the test set, which occurs when the training set data is processed in a way that it is dependent on the test set. This causes dependence between the examples, and is a specialised example of pitfall 2, detailed above. In genomics, this can occur any time data transformation is applied to multiple examples together. Both unsupervised embedding approaches such as principal component analysis, and supervised feature selection can result in leakage if there is no cross-validation. It is also problematic for unsupervised ML methods, including clustering.
This can be solved by using ML toolkits to learn parameters only from the training set, and then applying the transformation to the training sets and test sets separately. This process is already part of scikit-learn’s pipeline, transformer API, and the preProcess parameter of the caret package’s train() function.
Pitfall 5: Unbalanced Classes
Classes are unbalanced when examples are not evenly distributed across values of the outcome. Very few datasets are equally balanced, and some datasets in biology are extremely unbalanced. For example, a study predicting deleterious non-coding variants had 400 positive values and 14 million negative values for Mendelian disease. This can result in a model that over-learns the majority class, and under-learns the minority class. This is especially important in developing models that must avoid false negatives, such as detecting disease. Additionally, unbalanced data can also affect regression tasks performances, even when labels are not discrete classes, but continuous values.
Researchers can address this issue by using rebalancing to increase minority class performance. The three basic approaches are undersampling the majority class, oversampling the minority class, and weighting examples. These strategies increase accuracy of the minority class, but decrease it for the majority class. This is often the outcome that is wanted, although it does have certain disadvantages that must be considered.
These five pitfalls are often easy mistakes to make, but subtle enough to go undetected. It is therefore crucial to implement the strategies discussed to avoid them. As the reviewers wrote “This is easier said than done, but the trustworthiness of ML in biomedical research depends on these strategies.”
Image Credit: Canva