Mendelian Randomization

Categories of Data Science

Hernan et al.¹ assert that the core activities of the data scientist are, in order of increasing complexity:

Description: Computing summary statistics, to enable large datasets to be easily comprehended. An example would be producing a chart summarizing the characteristics of heart disease patients.
Prediction: Estimating the distribution of one variable conditional on other variables. An example would be calculation of a person's odds of heart disease, conditional on their demographics, genetics, and LDL level.
Causal Inference: Reasoning about counterfactuals. An example would be evaluating the evidence in support of a statement like: "If all patients with high LDL were put on statins, occurrence of heart disease would be reduced by \(X\) percent".

Here, we focus on causal inference.

Methods of causal inference

The randomized controlled trial (RCT) is the gold standard causal inference method. RCTs have the advantage of requiring few assumptions, but the disadvantages of being costly and time-consuming. This motivates alternative causal inference techniques, which require more assumptions, but are cheap and fast. Mendelian Randomization (MR) is such a technique.

Mendelian Randomization

As an illustrative example, consider the question of how LDL affects heart disease. We might notice from epidemiological studies that people with heart disease tend to have higher LDL than people without it. We recall, however, that correlation is not causation. Thus we apply MR, in hopes of determining whether there is a true causal effect.

The three main MR assumptions

The figure below from Hartley et al.² summarizes MR:

We apply MR to estimate the causal effect of an exposure (e.g. LDL) on an outcome (e.g. heart disease). MR requires a genetic variant ("Genetic Instrument") associated with the exposure.

The validity of MR depends on three main assumptions:

IV1: The instrument must be associated with the exposure.
IV2: The instrument cannot share a cause with the exposure or the outcome.
IV3: The instrument cannot be causally associated with the outcome, except via the exposure.

In MR, natural variation in the genetic instrument plays a role analogous to the random treatment assignment in an RCT. The analogy between MR and an RCT is illustrated in the diagram below, from Zuber et al.³:

The Fourth MR Assumption

Many texts on MR emphasize IV1, IV2, and IV3. However, these conditions are not sufficient to uniquely determine the causal effect of the exposure on the outcome (See Hernan and Robins Chapter 16⁴). Another assumption is required. There are numerous possible variants forms of this fourth assumption, which can largely be grouped into

Homogeneity assumptions: roughly, the causal effect of the exposure on the outcome is the same for all individuals.
Monotonicity assumptions: roughly, the instrument moves the exposure in the same direction for all individuals.

MR texts tend to ignore the need for a fourth assumption because most practical applications of MR assume that the outcome \(Y\) is a linear function of the exposure \(X\):

\[ \begin{align} Y &= \beta_{Y,X}X +\alpha_1 + \epsilon &\text{ where }\epsilon \sim N(0,\sigma_1) \end{align} \]

Under this linear model, homogeneity holds automatically.

Pitfalls in Mendelian Randomization

Horizontal Pleiotropy

A common source of error in Mendelian Randomization occurs when the genetic instrument affects the outcome through a causal pathway that does not involve the exposure. This scenario, which is known as "horizontal pleiotropy", results in a violation of IV3. If the strength of the causal effect through the alternative pathway is large, the results of Mendelian Randomization can be misleading.

The diagram below illustrates horizontal pleiotropy:

graph LR
A[Genetic Instrument] --> B[Protein 1];
B --> C[Exposure];
C --> D[Outcome];
A --> E[Protein 2];
E ----> D;

A genetic variant affects the levels of two proteins. One protein affects the outcome through the exposure, while the other affects the outcome independently of the exposure. If we believe the Omnigenic Model, we should expect this kind of horizontal pleiotropy to be relatively common.

The risk of horizontal pleiotropy is magnified when the connection between the genetic variant and the exposure is complex and indirect. An example would be a study in which the genetic instrument affects neurodevelopment, and the exposure is tobacco use. The risk is reduced when the connection is straightforward and direct. An example would be an MR study in which the exposure is the plasma level of a a protein, and the genetic instrument is a cis-regulatory variant for that protein (a cis-pQTL).

Methods of Mendelian Randomization

The Wald Ratio

todo

Links

Talk on MR by Dr. Jean Morrison

Miguel A Hernán, John Hsu, and Brian Healy. A second chance to get causal inference right: a classification of data science tasks. Chance, 32(1):42–49, 2019. URL: https://www.tandfonline.com/doi/abs/10.1080/09332480.2019.1579578%4010.1080/tfocoll.2022.0.issue-teaching-simpsons-paradox. ↩
April E Hartley, Grace M Power, Eleanor Sanderson, and George Davey Smith. A guide for understanding and designing mendelian randomization studies in the musculoskeletal field. Journal of Bone and Mineral Research Plus, 6(10):e10675, 2022. URL: https://academic.oup.com/jbmrplus/article/6/10/e10675/7479111?login=false. ↩
Verena Zuber, Nastasiya F Grinberg, Dipender Gill, Ichcha Manipur, Eric AW Slob, Ashish Patel, Chris Wallace, and Stephen Burgess. Combining evidence from mendelian randomization and colocalization: Review and comparison of approaches. The American Journal of Human Genetics, 109(5):767–782, 2022. URL: https://www.cell.com/ajhg/fulltext/S0002-9297(22)00149-5. ↩
Miguel A Hernán and James M Robins. Causal inference. 2010. URL: https://miguelhernan.org/whatifbook. ↩