Causality and Machine Learning

Causality and machine learning are two fields that have both had remarkable progress through the last few decades. The progress in machine learning, or artificial intelligence as it is often also called, has been particularly visible.

Image may contain: Body jewelry, Font, Gas, Jewellery, Circle.

Illustration: Colourbox

The fundamental breakthrough became apparent roughly one decade ago, when an approach often referred to as deep learning started to dramatically surpass the performance one had previously achieved on complex data like images. While the underlying workhorse �C neural networks trained by backpropagation �C had been around for almost half a century, a myriad of improvements in their mathematical setup, in available hardware and in available dataset sizes came together to allow what has been termed deep learning: that neural networks are able to learn suited representations of data from the data itself. At the same time, it is interesting to note that one part of the breakthrough was also to limit what the neural networks were able to learn, through what is often referred to as inductive biases or induced priors. This allowed the networks to better separate noise from what is more likely to represent true patterns of interest in the world, and have given rise to popular components known as convolutional layers and transformers. However, despite successes, there are still large outstanding challenges in terms of the ability to learn models that are interpretable and robust across application settings, especially in settings where very limited data is available.

The progress in causal analysis, referred to even as the ��causal revolution��, has probably gone more under the radar for the general public, but is in many ways equally dramatic. The well known phrase "association does not imply causation" was for a long time used by scientists as a justification to avoid the discussion on whether one thing causes another. In particular, the dominating viewpoint in the field of statistics was that not much more could be said on this topic, and that causal relations could only be learnt through systematic experimentation. Statisticians like Donald Rubin started to challenge this assumption almost 50 years ago, developing and popularizing what is known as the potential outcomes framework. The fundamental insight was that under certain assumptions, which can be reasonable in many scientific contexts, it is indeed possible to disentangle the causal component from an observed association, and thus learn about causal relations from purely observational data. This opened up the possibility for learning about causal effects in settings where experimentation is not possible or not ethical, for instance the effects of drug use for vulnerable groups like children or pregnant women. Another key innovation driving the causal revolution forward took place in the late nineties when Judea Pearl developed the so-called causal calculus, or do-calculus, within the framework of structural causal models. Do-calculus is simply a collection of three rules that, based on an assumed causal graphical structure, tells us how to compute interventional probabilities, and thus causal effects, from standard conditional probabilities that can be estimated from observational data, whenever this is possible. Building on the above-mentioned innovations and insights, an array of causal inference tools have already been developed, with applications in a wide variety of fields, including economics, biology, medicine and epidemiology.

In recent years, there has also been an increasing interest into several different ways in which the fields of causality and machine learning may interact. On one hand, there is interest in how machine learning methods may be used to infer causal effects from observational data under various assumptions about the underlying data-generating process (causal structure learning, double machine learning etc). On the other hand, and in the opposite direction, there is interest into how the underlying causal structure may influence machine learning (causal versus anti-causal learning, implications for semi-supervised learning etc). Also, there is work into how observational (and/or experimental) data from heterogeneous environments may be fused, based on assumptions about the underlying causal structure, in order to successfully answer a query of interest.

Johan Pensar is a researcher in statistics and data science and one of his main research interests is the use of machine learning to learn about the causal structure of a system under study from primarily observational data, that is, based on the available data, what can be said regarding the causal relationships between the components (or variables) in the considered system. Geir Kjetil Sandve is a machine learning researcher who is particularly interested in how a priori knowledge about the conceivable underlying causality in a domain may help to learn good machine learning models from less data, as well as to help learn models that are more stable when transferred from the setting where they are learnt to a different setting where they are to be applied (e.g. moving from a study setting to clinical application of a learnt biomarker). They have currently been collaborating on several projects that combine their complementary expertise in machine learning and causality.

In one project, driven by postdoc Milena Pavlovic, they are exploring the implications that the assumed causal relations for a given domain (adaptive immunity) have for data efficiency and robustness of machine learning. More specifically, they find that machine learning models for predicting disease state based on the DNA sequences of a patient's immune cells should be allowed to exploit biologically stable confounder paths. On the other hand, one should avoid study designs that introduce spurious paths that could lead to shortcut learning of predictive biomarkers (machine learning models) that do not carry over to clinical application settings.

In another project, driven by PhD student Ghadi Al Hajj, they have developed a new software framework for simulating data based on a provided causal model. The motivation behind the project is based on the observation that while there is increasing interest in how underlying causal relations and study design choices influence the behavior of machine learning models in complex data settings, current software tools for simulating data from causal models are limited to simple data types and functional relations. To fill this gap, they have thus developed a new light-weight software framework for simulating data from a defined graphical structure without any restrictions on data complexity or functional relations.

By Geir Kjetil Sandve, Johan Pensar

Published Feb. 8, 2023 10:10 AM - Last modified Feb. 8, 2023 10:10 AM