Methods development

New technology and growth in the size of the typical data set bring both challenges and new opportunities for genetic analysis. We're developing novel statistical approaches to help us understand the contribution of genetics to the risk of a disease (heritability) and how much overlap there is in the genetic drivers of different conditions (genetic correlation). 

 
 
 

LD Score regression and extensions 

Methodological approaches have been invaluable in increasing our knowledge of the genetic contribution to complex traits. Approaches using GWAS summary statistics are particularly useful, as such data are readily available from across a raft of diseases and traits. An example of such an approach is LD-score regression  which has been widely successful. The method hinges on an elegant relationship between marginal effect sizes and the smeared tagging effects of LD in that population.

A number of extensions have already been developed, including examinations of the correlation between genetic architectures as well as partitioning the genome to determine the relative contribution of constituent parts to variance in phenotype. Just as common SNPs in aggregate explain substantially more narrow sense (additive) heritability than do genome-wide significant SNPs alone, one might hope to detect additional variance explained by aggregating nonlinear effects and interactions, even when one remains underpowered to detect such effects individually. We are actively working on a number of further extensions to the LD score framework, including the incorporation of dominance, epistasis, and gene by environment interactions. Some of these extensions require additional summary statistics; we are also exploring extensions of other techniques that become available with full access to genotype data, such as the Haseman Elston / Phenotype-correlation Genotype-correlation (PCGC) framework.

 
 
 

Genetic correlation

Identifying genetic correlations between complex traits and diseases can provide useful etiological insights and help prioritize likely causal relationships. The major challenges preventing estimation of genetic correlation from genome-wide association study (GWAS) data with current methods are the lack of availability of individual genotype data and widespread sample overlap among meta-analyses. We circumvent these difficulties by introducing a technique – cross-trait LD Score regression – for estimating genetic correlation that requires only GWAS summary statistics and is not biased by sample overlap. We use this method to estimate 276 genetic correlations among 24 traits. The results include genetic correlations between anorexia nervosa and schizophrenia, anorexia and obesity and associations between educational attainment and several diseases. These results highlight the power of genome-wide analyses, since there currently are no significantly associated SNPs for anorexia nervosa and only three for educational attainment.

 
 
 

estimating true genetic effects unconfounded by LD

We are working on novel complementary methods to LD score to probe and increase our understanding of the genetic architecture of complex traits. A Kalman filter approach that we are developing, is based on Fisher's infinitesimal model. By exploiting the approximately banded nature of LD, we can construct a Kalman filter in which true effect sizes are the hidden states. Such an approach can be used to estimate heritability, as well as the underlying true effect sizes under the model. These estimates, which account for the correlation structure imposed by LD, have a natural application in generating polygenic risk scores which may explain a greater proportion of phenotypic variation when applied across human populations.

 
 
 

Universal control repository network

A major obstacle to the identification of genetic risk factors for common human diseases is the availability of large control samples. Genetic studies often restrict genotyping to cases as a cost-saving measure, which necessitate the use of external control resources.  We propose a new tool to facilitate such discoveries: UNICORN (Universal Control Repository Network). UNICORN will enable the entire genetics community to perform association analyses for case collections, without the need for direct access to individual genotypes. UNICORN builds on the observation that genetically described ancestry is essentially a continuous space and allele frequencies vary relatively smoothly within that space. Based on population ancestry, it is possible to predict what the frequency of a given allele in controls drawn from a similar population. With this statistical technique, we can generate well-matched allele frequency information for virtually any case sample, enabling genetic association analysis without access to the individual level control data. UNICORN will facilitate case control studies even if the study has characterized only case samples, thereby boosting power to discover risk variants.

We've developed a pilot of this project that demonstrates the utility of the approach, and are now working to expand its functionality so that it may be used by the wider community. 

 
 
 

Mendelian randomization

By examining the associations between genetic loci and multiple correlated traits we can gain insight in to which of these traits play a causal role in increasing risk for an associated disease. 

Mendelian Randomization (MR) is a class of statistical methods that aim at providing unbiased causal effect estimate of an exposure on an outcome of interest. MR methods leverage the randomness of genetic variant assignment in the meiotic process to eliminate potential biases when estimating causal effects. However, MR methods require certain assumptions that may be violated in real data analysis. In our group, we focused on developing novel MR methods to expand the applicability of MR, which is limited by the assumptions of current methods.

Our latest work on MR methodology is call MR-PRESSO (Mendelian Randomization Pleiotropy RESidual Sum and Outlier) test. MR-PRESSO aims at pleiotropic bias in MR analysis, which is caused by violation of the exclusion restriction assumption. MR-PRESSO provides a global test that can detect pleiotropic bias in MR, identify the source of bias, and control for the bias to provide unbiased causal effect estimates. We applied MR-PRESSO on 82 traits and 4,250 pairwise causal effect analyses. Our results showed that pleiotropic bias widely exists in pairwise MR analysis among commonly studied human traits and diseases. We also showed that MR-PRESSO can control for the pleiotropic bias in most of the biased analyses.

The biorxiv preprint of our MR-PRESSO method can be found here, and the code to implement MR-PRESSO can be found here on GitHub.

 
 
 

RNa-SEQ NOrmalization to compare expression across tissues and cell types

Expression profiling can be a powerful tool for examining the role that individual genes and pathways play in different cell and tissue types. However, comparison between RNA-seq data in such a scenario is made difficult by the fact that these measures of expression are not absolute, and therefore the question of how to best make such comparisons remains open. We're developing a new approach, called Calibre, that uses a graph building approach to normalize expression data.

 
 
 

Universal Control Repository Network (UNICORN)

A major obstacle to identifying genetic risk factors is the availability of large control samples for association studies. While data repositories offer many control resources, administrative and computational hurdles are substantial. Further, ancestries of controls may not sufficiently match those of cases, leading to spurious associations. We are developing the Universal Control Repository Network (UNICORN), aggregating ~250,000 controls from 23 collections, to provide a global control resource for ancestry-matched allele frequency estimates. UNICORN will enable the entire genetics community to perform well-powered association analyses for case collections, without the need for direct access to individual genotypes.

UNICORN estimates the population allele frequency of each variant conditional on certain ancestry coordinates. Based on hierarchical clustering with PCA, a decision tree clustering samples with similar ancestry is learnt from the control set. Cases are assigned to clusters using the same decision rules, without genotype data exchanging hands. Finally, each case is paired with an ancestry-conditional allele frequency estimate. These estimates are then aggregated to yield a null distribution for the allele count. The association test then simply compares the allele count in cases with this null distribution.

Preliminary results demonstrate UNICORN enables powerful and reliable genetic association analyses without access to individual-level control data [1]. We are currently streamlining the previous model to facilitate a scalable implementation. In the meantime, through simulations, we are also assessing the limitations of the model for lower allele frequencies, for imputed variants, and for individuals with less well-represented ancestry as well as recent admixture. A web-based service will also be developed for convenient use, which we believe can power the next round of discoveries of novel genetic associations. This work will be presented at ASHG 2017 in an invited trainee session.