Enhanced prediction of gene and missense rare-variant pathogenicity by joint analysis of gene burden and amino-acid residue position
Waring AJ., Harper AR., Salatino S., Kramer CM., Neubauer S., Thomson KL., Watkins H., Farrall M.
<jats:title>Abstract</jats:title><jats:p>Although rare missense variants underlying a number of Mendelian diseases have been noted to cluster in specific regions of proteins, this information may be underutilized when evaluating the pathogenicity of a gene or variant. We introduce <jats:italic>ClusterBurden</jats:italic> and <jats:italic>GAMs</jats:italic>, two methods for rapid association testing and predictive modelling, respectively, that combine variant burden and amino-acid residue clustering, in case-control studies. We show that <jats:italic>ClusterBurden</jats:italic> increases statistical power to identify disease genes driven by missense variants, in simulated and experimental 34-gene panel for hypertrophic cardiomyopathy. We then demonstrate that <jats:italic>GAMs</jats:italic> can be used to apply the ACMG criteria PM1 and PP3 quantitatively, and resolve a wide range of pathogenicity potential amongst variants of uncertain significance. An R package is available for association testing using <jats:italic>ClusterBurden</jats:italic>, and a web application (<jats:italic>Pathogenicity_by_Position)</jats:italic> is available for missense variant risk prediction using GAMs for six sarcomeric genes. In conclusion, the inclusion of amino-acid residue positional information enhances the accuracy of gene and rare variant pathogenicity interpretation.</jats:p><jats:sec><jats:title>Author Summary</jats:title><jats:p>Two statistical methods have been developed that utilize signal in the residue position of missense variants. The first is a rapid association method that tests the joint hypothesis of an excess of rare-variants and rare-variant clustering. The method, <jats:italic>ClusterBurden</jats:italic>, is powerful when rare-missense variants cluster in discrete pathogenic regions of the protein. It can be applied to exome-scans to discover novel Mendelian disease-genes, that may not be identified by classic burden testing. The second method is a statistical model for rare-missense variant interpretation. It provides superior predictive performance compared to generic <jats:italic>in silico</jats:italic> predictors by training on our large case-control dataset. The method represents a data-driven quantitative approach to apply hotspot and <jats:italic>in-silico</jats:italic> prediction criteria from the ACMG variant interpretation guidelines.</jats:p></jats:sec>