LDSC

Linkage Disequilibrium Score Regression¹ (LDSC) is a technique for estimating heritability from GWAS summary statistics. LDSC is ubiquitous, but its usefulness depends on the validity certain modeling assumptions. This page includes both a high-level summary and a detailed derivation of LDSC.

Intuition and High-level summary

The core idea of LDSC is illustrated below:

For each SNP, we compute a quantity called Linkage-Disequilibrium Score, or LD score, which measures the strength of the correlation between this SNP and other SNPs. Two key insights are central to LDSC:

SNPs with higher LD scores will tend to be more strongly associated with the GWAS phenotype, and thus will have higher Wald test statistics. This follows because if a SNP is not itself causal, it can be associated with the GWAS phenotype through correlation with a causal SNP. A SNP with a higher LD score has a higher chance of being correlated with a causal SNP than a SNP with a lower LD score.
The strength of the relationship between LD score and the Wald test statistic depends on heritability. A more heritable trait (right panel) will exhibit a stronger relationship than a less heritable trait (left panel). This follows because for a more heritable trait, SNPs have a greater influence over the phenotype, so that increasing the extent to which a SNP is in LD with other SNPs will on average increase its association with the phenotype by a larger amount.

Using these two insights, we can estimate heritability from the slope of regression line of Wald test statistics on LD scores.

Detailed Derivation of Method

Data-generating model

LDSC assumes the following data generating equation:

\[ \phi = X\beta + \epsilon \]

where:

There are $M \gg 0$ genetic variants.
There are $N \gg 0$ individuals.
$\phi\in\mathbb{R}^N$ is the vector of phenotypes.
$X\in\mathbb{R}^{N\times M}$ is the genotype matrix, normalized to have columns with sample mean 0 and sample variance 1.
$\beta\in\mathbb{R}^M$ is the vector of true SNP effect sizes.
$X\beta \in\mathbb{R}^N$ is thus the vector of total genetic effects for each individual.
$\epsilon\in\mathbb{R}^N$ is the vector of non-genetic effects for each individual.

Furthermore, we model $X,\beta,\epsilon$ as random variables with the following properties:

\[ \begin{align} \mathbb{Var}(\epsilon)&= (1-h^2)I & \text{For some $h>0$} \\ \mathbb{Var}(\beta)&=\frac{h^2}{M}I \label{betavar} \\ \mathbb{E}(\epsilon)&=0\\ \mathbb{E}(\beta)&=0 \\ \mathbb{E} X &=0 \\ \end{align} \]

The rows of $X$ are independent and identically distributed.
The distributions of columns of $X$ are "not too similar".
The SNP $X_{i,j}$ may be highly correlated with a few other SNPs, but is uncorrelated with most SNPs.
$\beta,\epsilon,X$ are all mutually independent.

Furthermore, define the following quantities related to Linkage Disequilibrium (LD).

The LD between SNP $j$ and SNP $k$ is denoted by $r_{jk}:=\mathbb{E}X_{i,j}X_{i,k}$, (which does not depend on the individual $i$ by our assumption that the rows of $X$ are iid).
The empirical LD between $j$ and $k$ is defined to be $\tilde{r}_{jk}:=\frac{1}{N}X_{:,j}^T X_{:,k}$.
The LD score for a SNP $j$ is defined to be $l_j:= \sum_k r_{jk}^2$.

Properties of $\hat{\beta}_j$ and $\chi^2_j$

We begin by computing the variance matrix of the genetic effects $X\beta$:

\[ \begin{align} &\mathbb{Var}(X\beta)\\ &=\mathbb{E} X \beta \beta^T X^T\\ &= \mathbb{E} (X \mathbb{E} ( \beta \beta^T |X ) X^T) & \text{Tower law of expectation} \\ &=\frac{h^2}{M} \mathbb{E} X X^T & \text{by (\ref{betavar})} \\ &=h^2I & \text{By independence of rows, zero mean of $X_{i,j}$} \end{align} \]

We can also compute the variance matrix of the phenotypes:

\[ \begin{align} &\mathbb{Var}(\phi)\\ &= \mathbb{Var}(X\beta) + \mathbb{Var}(\epsilon) & \text{ By independence}\\ &=h^2I + (1-h^2)I\\ &= I. \end{align} \]

Note that the heritability of the phenotype for an arbitrary individual $i$ is:

\[ \begin{align} \frac{\mathbb{Var}(X\beta)_{i,i} }{\mathbb{Var}(\phi)_{i,i}}&= h^2 \end{align} \]

justifying the use of the symbol $h^2$.

The expectation of the phenotype vector is:

\[ \begin{align} &\mathbb{E} (\phi) \\ &= \mathbb{E}(X \beta) + \mathbb{E} (\epsilon)\\ &= 0 & \text{Since $\mathbb{E}\beta =\mathbb{E} \epsilon=0$ } \end{align} \]

In a GWAS, it is typical to run a separate single-variant regression for each variant. Denote by $\hat{\beta}_j$ the single-variant regression coefficient for SNP $j$. This is given by:

\[ \begin{align} \hat{\beta}_j &\\ &= \frac{\frac{1}{N}(\phi-\overline{\phi})^T(X_{:,j} -\overline{X_{:,j}} )}{\frac{1}{N} (X_{:,j} -\overline{X_{:,j}} )^T (X_{:,j} -\overline{X_{:,j}} ) } & \text{ By OLS formula}\\ &= \frac{\frac{1}{N}(\phi-\overline{\phi})^TX_{:,j} }{ \frac{1}{N}X_{:,j}^TX_{:,j} } & \text{Since columns of $X$ are normalized}\\ &\approx \frac{\frac{1}{N}\phi^TX_{:,j} }{\frac{1}{N}X_{:,j}^TX_{:,j}} & \text{Since $\mathbb{E}\phi=0$ and $N$ is large}\\ &=\frac{1}{N}\phi^TX_{:,i} & \text{Since columns of $X$ are normalized} \end{align} \]

The squared regression errors of the $j$th GWAS regression are given by

\[ \begin{align} \lVert\phi- \hat{\beta}_jX_{:,j} \rVert^2 = \lVert\phi- \frac{1}{N}(\phi^T X_{:,j}) X_{:,j} \rVert^2. \end{align} \]

We have assumed that $M$ is large, that columns of $X$ are not too similar, and that the components of $\beta$ are iid. Thus, GWAS regression on SNP $j$ will not explain any significant proportion of the variance of the phenotype. So

\[ \begin{align} &\lVert\phi- \frac{1}{N}(\phi^T X_{:,j}) X_{:,j} \rVert^2\\ &\approx \lVert\phi\rVert^2 \\ &\approx \mathrm{trace}\left( \mathbb{Var} (\phi) \right)\\ &= N \label{residuals} \end{align} \]

The standard error of the $j$th GWAS regression coefficient is

\[ \begin{align} &\mathrm{SE}(\hat{\beta}_j) \\ &\approx \sqrt{\frac{\frac{1}{N}\lVert\phi- \frac{1}{N}(\phi^T X_{:,j}) X_{:,j} \rVert^2}{(X_{:,j} -\overline{X_{:,j}} )^T (X_{:,j} -\overline{X_{:,j} })}} & \text{Formula for OLS SE}\\ &=\sqrt{\frac{1}{N}} & (\text{\ref{residuals}) + Normalization of $X$} \label{sebeta} \end{align} \]

By the definition of the Wald Test Statistic for the $j$th SNP, $\chi^2_j$, we have

\[ \begin{align} &\chi_j^2\\ &= \frac{\hat{\beta_j}^2}{\mathrm{SE}(\hat{\beta_j})^2}\\ &\approx N \hat{\beta_j^2} & \text{by (\ref{sebeta})} \label{wald} \end{align} \]

Expectation of empirical LD scores

The expectation of the square of the empirical LD between SNPs $j$ and $k$ is:

\[ \begin{align} &\mathbb{E} \tilde{r}_{jk}^2\\ &=\mathbb{E}\frac{1}{N^2}\sum_i \sum_q X_{i,j}X_{i,k}X_{q,j}X_{q,k}\\ &=\frac{1}{N^2}\mathbb{E}(\sum_{i\ne q} X_{i,j}X_{i,k}X_{q,j}X_{q,k} + \sum_i X_{i,j}^2X_{i,k}^2)\\ &=\frac{1}{N^2}( \sum_{i\ne q} \mathbb{E}(X_{i,j} X_{i,k })\mathbb{E} (X_{q,j X_{q,k}}) + \sum_i \mathbb{E} X_{i,j}^2 X_{i,k}^2) &\text{Independence of rows of $X$}\\ &= \frac{N-1}{N}r_{jk}^2 + \frac{1}{N}\mathbb{E}X_{1j}^2X_{1k}^2 \end{align} \]

We write $\mathbb{E}X_{1j}^2X_{1k}^2=1+2r_{jk}^2+\nu$ where we have used Isserlis's Theorem to compute the expectation of the product of the squares of two normal random variables, and then added error term $\nu$ to account for the non-normality of $X$.

Thus we have

\[ \begin{align} &\mathbb{E} \tilde{r}_{jk}^2 \\ &=\frac{N-1}{N}r_{jk}^2 +\frac{1}{N}(1+2r_{jk}^2+\nu)\\ &=r_{jk}^2 + \frac{1}{N}(1+r_{jk^2}^2+\nu)\\ &=r_{jk}^2+ \frac{1}{N}(2r_{jk}^2+\nu)+ \frac{1}{N}(1-r_{jk}^2) \end{align} \]

The expectation of the sum of the squares of the empirical LD scores with SNP j is given by

\[ \begin{align} &\mathbb{E} \sum_k \tilde{r}_{jk}^2\\ &=\sum_{k}r^2_{jk} + \sum_{k}\frac{1}{N}(2r_{jk}^2+\nu) + \sum_k \frac{1}{N}(1-r_{jk}^2)\\ &=l_j + \sum_{k}\frac{1}{N}(2r_{jk}^2+\nu) + \sum_k \frac{1}{N}(1-r_{jk}^2) & \text{Definition of $l_j$}\\ \end{align} \]

At this stage we introduce another approximation. We have assumed that SNP $j$ has high correlation with only a small number of other SNPs. Thus the first $O(1/N)$ term can will be much smaller than the second. We therefor neglect this term, yielding:

\[ \mathbb{E}\sum_k \tilde{r}_{jk}^2 \approx l_j + \frac{1}{N}(M-l_j) \]

The LDSC Equation

With the preliminaries out of the way, now compute the expectation of the chi squared statistic of the $j$th SNP

\[ \begin{align} &\mathbb{E} (\chi^2_j)\\ &=N \mathbb{Var}(\hat{\beta_j)} & \text{by (\ref{wald})}\\ &=N \mathbb{E} \mathbb{Var}(\hat{\beta}_j | X) + \underbrace{\mathbb{Var}\mathbb{E}( \hat{\beta}_j | X )}_{=0} & \text{Law of Total Variance}\\ &= N \mathbb{E}( \mathbb{Var}( \frac{1}{N} \phi^T X_{:,j} |X) ) \label{varphi} \\ &= N \mathbb{E}( \mathbb{Var}( \frac{1}{N} (X\beta + \epsilon)^T X_{:,j} |X) ) \\ &= N \mathbb{E}( \mathbb{Var}( \frac{1}{N}( (X\beta)^TX_{:,j} + \epsilon^T X_{:,j}) |X) ) \\ &= N \mathbb{E}( \mathbb{Var}( \frac{1}{N}( X_{:,j}^T (X\beta)+ X_{:,j}^T\epsilon ) |X) ) \\ &= N \mathbb{E}( \frac{1}{N}( X_{:,j}^T X \mathbb{E}(\beta \beta^T|X) X^T X_{:,j} + X_{:,j}^T \mathbb{E}(\epsilon\epsilon^T)X_{:,j} ) ) \\ &= N \mathbb{E} ( \frac{h^2}{MN^2} X_{:,j}^T X X^T X_{:,j}+ \frac{1}{N} (1-h^2) )\\ &= N \mathbb{E} ( \frac{h^2}{M}\sum_k \tilde{r}^2_{jk} + \frac{1}{N} (1-h^2) ) &\text{def of }\tilde{r}^2_{jk} \label{hstep}\\ &= \frac{h^2}{M}(Nl_j + M-l_j) + 1 - h^2\\ &= \frac{h^2}{M}l_j(N-1)+1\\ &\approx \frac{h^2}{M}l_jN+1 \label{main_ld_eq} \\ \end{align} \]

This is the main Linkage Disequilibrium Score Regression equation.

Intuition about derivation

The key steps in the LDSC derivation above are between equations ($\ref{varphi}$) and ($\ref{hstep}$). These steps relate the GWAS regression coefficients to a measure of linkage disequilibrium between SNPs. These steps are only possible because of our distributional assumptions on $\beta$ and $\epsilon$.

Critique of assumption

We saw above that the most important of LDSC's assumptions is that $\mathbb{Var}\beta=h^2 I$. This can be understood as an isotropic polygenicity assumption, and it amounts to the assumption that the heritability of a trait is distributed evenly across the genome, without correlation between SNPs.

How plausible is this assumption?

Arguments that assumption is plausible

On the one hand, the discovery that many traits are highly polygenic has been one of the most important findings of the GWAS era.

The following figure from Shi et al.² is a striking illustration:

The authors applied HESS to Schizophrenia, a prototypically polygenic disease. HESS estimates the proportion of heritability attributal to each chromosome. They found that the schizophrenia heritability of a chromosome was approximately proportional to the chromosome's length. This is consistent with uniform polygenicity.

See also this talk by Jonathan Pritchard, which proposes that most traits are "omnigenic", in the sense that they are controlled by vast numbers of small contributions spread across the genome:

So, in a rough sense, assuming that the heritability of a trait is distributed across the genome is not unreasonable.

Arguments that assumption is implausible

On the other hand, the assumption of perfectly uniform polygenicity strains plausibility. For most traits, heritability is concentrated in certain key regions. In autoimmune diseases, for example, heritability is typically concentrated around immunological regions, like the MHC/HLA region.

Extensions to make assumption more plausible

The issue of the implausibility of isotropic polygenicity is partially resolved by Stratified Linkage Disequilibrium Score Segression³ (S-LDSC), an extension proposed by the same authors who devised LDSC. S-LDSC allows the use of a pre-specified functional partitioning of the genome. While heritability is still assumed to be evenly distributed within a given partition, S-LDSC allows it to differ across partitions.

Population structure and LDSC

Besides being the most widely used method for the estimation of heritability from GWAS summary statistics, LDSC is also used in the detection of population structure. In the next section, I will explain this use case.

A model of a mixed population

Let's begin by constructing a model of a mixed population. For illustrative simplicity, we consider an equal mixture of two sub-populations. We modify the model described above as follows.

Let $\mathcal{P}_1$ denote the random set of individuals in the first subpopulation, and let $\mathcal{P}_2$ denote the random set of individuals in the second subpopulation. Since we are assuming an equal mixture, we have that for any individual $i$, $P(i\in \mathcal{P}_1)=P(i\in \mathcal{P}_2)=0.5$.
Let $\mathcal{Q}\in\mathbb{R}^N$ be the random vector of assignments of individuals to the two subpopulations.
Let $f\in\mathbb{R}^M$ denote the random vector of genotype means in subpopulation 1. Thus $\mathbb{E}(X_{i,j}|f,i\in\mathcal{P}_1)=f_j$. Since we still assume that the population as a whole is normalized to genotype means of zero, this implies that $\mathbb{E}(X_{i,j}|f,i\in\mathcal{P}_2)=-f_j$. Note also that unlike above, in the model of this section, the rows of the genotype matrix $X$ are no longer unconditionally independent. This is because knowledge of one row of $X$ informs us about $f$, which informs us about other rows of $X$. Conditioned on $f$, however, the rows of the genotype matrix are still independent.
We assume that $f\sim N(0, F_{ST}V)$, where $F_{ST}$ is a constant, and $V$ is a correlation matrix that is "close to diagonal", in the sense that there are no long-range correlations.
We assume that the correlation between $X_{i,j}$ and $X_{i,k}$ is constant across subpopulations and equal to $r_{j,k}$. The authors of the LDSC paper justify this choice by saying that they are assuming that large subpopulation differences have already been appropriately removed by subtracting principal components, so that only small subpopulation differences remain.
As above, we assume that the unconditional variance of each entry of the genotype matrix is 1: $\mathbb{Var}(X_{i,j}=1)$
We assume that for any individual and any variant $i$,

\[ \begin{align} \mathbb{Var}(X_{i,j}|f, \mathcal{Q})=1-f_j^2 \label{sub_var} \end{align} \]

Note that this implies that

\[ \begin{align} &\mathbb{Var}(X_{i,j}|f)\\ &= \mathbb{Var}( \mathbb{E}(X_{i,j}|f,\mathcal{Q}) |f ) + \mathbb{E}( \mathbb{Var}(X_{i,j}|f,\mathcal{Q}) |f ) & \text{ By law of total variance}\\ &= \mathbb{Var} (\pm f_j |f) + \mathbb{E} (1-f_j^2|f) & \text{by (\ref{sub_var})}\\ &= f_j^2 + 1-f_j^2\\ &= 1 \end{align} \]

and also

\[ \begin{align} &\mathbb{Var}(X_{i,j})\\ &= \mathbb{Var}( \mathbb{E}(X_{i,j}|f) ) + \mathbb{E}( \mathbb{Var}(X_{i,j}|f)) & \text{ By law of total variance}\\ &=0+1\\ &=1 \end{align} \]

So the variance of the genotypes in the subpopulations is consistent with the variance of the genotypes in the mixed population.

Our goal is to derive an analogue of ($\ref{main_ld_eq}$), the main LD score regression equation, for this case of an equal mixture of two subpopulations.

We next compute the conditional expectation of the product of two variants (fix this to account fo subpopulatoon variance):

\[ \begin{align} \mathbb{E}(X_{i,j}X_{i,k}|f)&=0.5 \mathbb{E}(X_{i,j}X_{i,k}|f, i\in\mathcal{P}_1) + 0.5 \mathbb{E}(X_{i,j}X_{i,k}|f, i\in\mathcal{P}_2)\\ &=0.5(r_{j,k} + f_{j}f_{k}) + 0.5(r_{j,k} + (-f_{j}) (-f_{k}) )\\ &=r_{j,k} + f_jf_k. \label{cond_prod_eq} \end{align} \]

We next compute $\mathbb{E} \tilde{r}_{j,k}$, the expectation of the sample correlation between variant $j$ and variant $k$. This quantity will be key to computing LD scores.

We have

\[ \begin{align} &\mathbb{E} \tilde{r}_{j,k}^2\\ &=\mathbb{E} \mathbb{E} \left( \tilde{r}_{j,k}^2 |f \right) \text{ (by Tower Law)} \\ &= \mathbb{E} \mathbb{E} \left(\frac{1}{N^2}\sum_i \sum_q X_{i,j}X_{i,k}X_{q,j}X_{q,k}|f\right)\\ &=\frac{1}{N^2}\mathbb{E} \mathbb{E} \left(\sum_{i\ne q} X_{i,j}X_{i,k}X_{q,j}X_{q,k} + \sum_i X_{i,j}^2X_{i,k}^2|f\right)\\ &=\frac{1}{N^2}\mathbb{E}\left(\sum_{i\ne q}\mathbb{E}(X_{i,j}X_{i,k}|f)\mathbb{E}(X_{q,j}X_{q,k}|f)+ \sum_i \mathbb{E} (X_{i,j}^2 X_{i,k}^2|f) \right) \text{ (Conditional indep)}\\ &\mathbb{E}\left( \frac{N-1}{N} (f_jf_k+r_{j,k})^2 +\frac{1}{N^2}\mathbb{E} (X_{i,j}^2 X_{i,k}^2|f) \right) \text{ (by (\ref{cond_prod_eq}) )}\\ &=\mathbb{E}\left( \frac{N-1}{N}\left(f_j^2 f_k^2+2r_{j,k}f_j f_k +r_{j,k}^2 \right) +\frac{1}{N}(1+2(f_jf_k+r_{j,k})^2+\nu) \right) \text{ (Isserlis theorem)}\\ \end{align} \]

To be continued

Brendan K Bulik-Sullivan, Po-Ru Loh, Hilary K Finucane, Stephan Ripke, Jian Yang, Nick Patterson, Mark J Daly, Alkes L Price, and Benjamin M Neale. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics, 47(3):291–295, 2015. URL: https://pmc.ncbi.nlm.nih.gov/articles/PMC4495769/. ↩
Huwenbo Shi, Gleb Kichaev, and Bogdan Pasaniuc. Contrasting the genetic architecture of 30 complex traits from summary association data. The American Journal of Human Genetics, 99(1):139–153, 2016. URL: https://www.cell.com/ajhg/fulltext/S0002-9297(16)30148-3. ↩
Hilary K Finucane, Brendan Bulik-Sullivan, Alexander Gusev, Gosia Trynka, Yakir Reshef, Po-Ru Loh, Verneri Anttila, Han Xu, Chongzhi Zang, Kyle Farh, and others. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics, 47(11):1228–1235, 2015. URL: https://www.nature.com/articles/ng.3404. ↩