Using Large Scale Genomic Databases to Improve Disease Variant Interpretation

Using Large Scale Genomic Databases to Improve Disease Variant Interpretation


So it’s my pleasure to
introduce Daniel MacArthur, who is from just down the street
and has been instrumental in creating a extremely
valuable resource, that based on the title of his
talk, I think he’s going to tell us about and probably how it’s
changing the world, so great.>>Cool, thank you Jennifer and thanks for the opportunity to be
here today it’s a great pleasure to be here, of course. So Jennifer mentioned I
work just down the road at the Broad Institute. I have a lab, but I also spend a fair bit of time
over at Mass General Hospital. We have two focuses, one is on
rare disease diagnosis, which you’ve heard a lot about and
I’m not gonna talk about today. And the second,
as Jennifer mentioned, is thinking about ways in which
we can build up very large collections of genetic variation
from the general population. Primarily, with
the focus of using these as reference databases to
understand what variants do and which ones are most important
for causation of rare disease. And I think I wanted to
just start with generally framing a sense of why I’m so excited about human genetics
as we stand right now. I think it’s actually the most
exciting field that’s going on at the moment. And it’s the most exciting time
to be working in human genetics. And that’s because human
genetics provides us a way of moving back and forth between
two very different and powerful kinds of information. So we are becoming
increasingly good, as you’ve heard from
other speakers today, at layering information
across the genome. So here in this random
screenshot from the USOC Genome Browser, we
have some chunk of the genome. And for
each base in the human genome, we have information about if
that’s present in a gene or not. If so, where that gene
is actually expressed. If it’s in a non coding region, whether that is also a
regulatory region that may play some role in regulating
gene expression. And this information is
increasing thanks to information like ENCODE and
Epigenomics Roadmap and GTEx and other resources. And then another type of
information that we as the broader biomedical community
have become very good at collecting is information
about human phenotypes. We now have millions of people
around the world who have been collected in large,
often national scale bio banks. And these people have
a whole variety of different bits of information
collected about them. And these can range from
information about particular diseases they’ve
been diagnosed with, which may be stored in a more or
less well formatted fashion. So all sorts of electronic
medical record data, prescription registers, quantitative trade data and
questionnaires. And then in many cases, a whole
bunch of interesting and rich biological or functional genomic
data, like transcript mix or metabeilomix that can give us
some insight into the biology of those individuals. And human genetics is cool, in part, because it provides us
with a way of moving back and forth between these two
kinds of information. So the traditional, forward genetics way of
doing human genetics, is that we would collect people
on the bases of phenotype. So do something like what
Gill just talked about, identify certain families who
have some interesting and very rare disease,
group those people and then see what pieces of
DNA they have in common. That is what variation they have
in common, and use that to zoom in on particular genes that
may underlay their disease. This has been an incredibly
productive approach. We now know more than 4,000
genes associated with Mendelian diseases. And many thousands of regions in
the genome that are associated with common complex disorders
like type 2 diabetes, for instance. But as we start to build
up these big bio banks, and we start to sequence, that is
do DNA sequencing, on more of the people in those bio
banks, it becomes feasible to start thinking about moving in
the opposite direction as well. That is to find people and group them on the basis
of their genotype. So find, for instance, all
the people who have particular disruptive mutations in
a particular gene, group them on that basis and then look to see
what they have in common in terms of their traits, is what
we would call reverse genetics. It’s a very common approach to
doing things in model organisms where we can actually
genetically engineer particular mice to lack a specific gene,
for instance, and then see what happens to them
when we knock that gene out. Now, of course, in humans,
we can’t do genetic engineering. But what we can do is take
advantage of the enormous amount of genetic variation that exists
in the general population. And the amount of variation that
exists is genuinely staggering. You can calculate actually
relatively easily based on what we know about
the distribution of mutation rates across the genome
and how many people exist. That effectively, any given
single-nucleotide substitution, that is at a particular
position in the genome or changing to an A, or a C, or
a G, almost certainly exists out there among the 7 billion people
that exist in the population. So that means humanity in effect
represents an enormous natural experiment where we have tried, certainly not the complete space
of human genomic sequence, but perturbations across many
different regions of the genome. And all we need to do to
understand exactly what effect each of those perturbations
have is sequence the genomes of all 7 billion people on
the face of the planet. Hook those into a big
electronic medical record system that allows
us to go back and recontact people based
on that information. And then systematically collect,
for instance, all the knockouts in
a particular gene and see what they have in common. Now, we are not,
as you guys know, anywhere close to being able to
sequence 7 billion genomes yet, although that will
happen eventually. But what we can do is to start
to accumulate information that has been collected across
now well over 2 million people who’ve had either their
whole genome sequenced or in many cases their exomes. That is a particular technology
that allows us to look just at the protein coding
bits of the genome. So this 2 million
number is a lower, sort of ball park estimate. These 2 million people have
been sequenced through a whole host of different academic and
commercial ventures and this information is being generated
in sites around the world, collected for a whole variety
of different projects. But just to give you an idea
about how fast that information is growing, curves like this I’m
sure have been presented earlier today, but this is information
just from one place the Broad Institute, again, a
couple of blocks down the road. You can see the growth in the
number of genomes that have been sequenced from 2009 to 2016. This information’s about,
I guess about 8 months, 10 months out of date. Over here we have the number
of exomes sequenced. And the two bits of information
you should be taking from these curves is firstly the number at
the top, which it tells us we had at the end of last year
sequenced about 70,000 genomes and over a quarter of
a million human exomes. It’s a staggering
amount of data, but also the shape of this curve. In 2016 we sequenced more
genomes than in all years previously combined. And so the trajectory at which
we are generating this data is truly staggering. Now, ideally what we wanna do
is to take all these 2 million sequenced individuals and house them all together in one
big unified bio bank that would allows us to link between
genotype and phenotype, but this turns out to be difficult
for all sorts of reasons. There’s mundane obstacles
associated with moving data from one place to another. It turns out because these
samples have been collected over many decades, often well before
they were sequenced, the consent that they were consented under
is often inadequate to justify very large scale data sharing. And in fact, a lot of my life is
dealing with inadequate consent and data use permissions. There are objections to data
sharing, so much of this data has been generated in
an industry setting and that is often regarded
as a trade secret. You guys, I’m sure,
will be shocked to learn there are even some academics
that are not fully on board with the idea of sharing data,
so this can be challenging. And then a final problem that we
face is the fact that different centers who generate this
data tend to process and perform variant polling across
that data in idiosyncratic ways. So if we were to just take the
outputs of all of each of these different centers and try to
merge together this matrix of 2 million sets of genotypes, we would end up with a variant
core set that was dominated by technical differences
between centers, as opposed to interesting biological
differences between people, which of course is not useful. So to resolve at least some of
these problems we have been working I guess for about the
last four years on building up large reference databases
of human variation, leveraging the expertise and the
data that exists at the Broad. So when we started this exercise
back in 2012 there were two large public data sets
of human variation. We had the 1,000
Genomes Project. Which contained about two and half thousand, low coverage,
whole human genomes from a variety of different
populations. It’s a fantastic project. We also have the NHLBI’s Exome
Sequencing Project, which has generated exome sequencing from
about six and half thousand individuals from either European
or African American descent. And so to cut a very long story
short, we began analyzing rare disease patients using these
resources and rapidly found they were too small, and in many
cases, too old to be good comparison data sets for
our rare disease patients. So we built, over a relatively
painful 18 month period, our first large core
set of human exomes. This is called the Exome
Aggregation Consortium, or ExAC data set,
that was released back in 2014. And that contained information
from just over 60,000 human exomes. And then almost to the day two
years later, in October last year, we released
an updated version of ExAC, this is now called the Genome
Aggregation database, or gnomAD. Because we like
slightly curious and idiosyncratically capitalized
names. And this now contains about
twice the number of human exomes and for the first time in Nomad, we’ve also included a whole
collection of human genomes. So providing information about
the distribution of variation across the noncoding, that is the non-protein
coding bits of the genome. So it’s important to note
that we do not sequence these samples. These data all come from
various studies, and I’ll list the PIs and where these
samples come from in a minute. For many different studies,
we aggregate that data and we put it through the same
processing pipeline. Then we do what we call, joint variant coding across all
samples simultaneously or less. And that allows us to get
a harmonized code set across these large numbers of samples. And this is a testament
to the engineering and scaling work that’s been
done at the Broad Institute. So this is brief summary
of the pipeline. I’m not gonna go through
the details about how a variant calling pipeline works. I don’t think it’s hugely
relevant for this audience. I guess it is worth
emphasizing on this slide, we start with raw data. We go through an individual
sample variant, genotyping variant
calling approach. We then do joint genotyping
across large numbers of samples. There’s a variant
recalibration approach. And then we have
this final call set, this very big variant call file,
or VCF file, that stores all the data
across these samples. So the thing I did
wanna mention, though, is this is a lot of data. So the raw data that’s gone into
this project over the exome data which is on the left-hand side,
and the genome data on the right-hand side is about three
petabytes of raw sequence data. And that condenses down to about
a 40 terabyte variant core file which is actually still a
staggeringly large file that we have to work with. Many people were involved
in generating and analyzing these data. And I wanted to
particularly call out the Broad Data Sciences Platform
for building the tools that allowed us to call
variants in this scale. But also incredibly
importantly a tool kit called the Hale System which was
built at the Broad Institute. Which Alex here in the room is a
representative of the Hale team. And Hale was absolutely critical
in being able to go from this 40 terabyte VCF, and actually do important things
with it like quality control, and assess the variants, and release data in a useful
way to the public. And a huge thanks to all the members of the team
who contributed to this as well. So it’s worth emphasizing
who’s actually ended up in these databases. So we have these rough, over
135,000 people across the exomes and the genomes who were
in the gnomAD database. These people are not
necessarily well. They come primarily from
case control studies of relatively common complex
adult onset diseases. The big hitters in terms of
the major diseases included in the database
are type 2 diabetes, early onset myocardial
infarction or heart attack, and neuropsychiatric conditions
like schizophrenia and bipolar. So about half the people are
controls for those studies and about half the people are cases. Now we have as best we can
depleted this resource of people known to have
severe pediatric disease. So the types of underlying
diseases that Gil talked about as well as their first
degree relatives. But of course some individuals
in this cohort will still be ill or will go on
to become ill later on through adult onset
rare disorders. However, we think that the
levels of severe disease within this population are probably
broadly comparable to the general
population as a whole. So this is a reasonable
reference data set if you’re interested in figuring out, if a
variant that is present in your rare disease patient is present
in the general population. And if so, what frequency it
has within that population. And I won’t go into detail
about some of the work we’re doing here. But about 40% of the samples
have some clinical data available. In fact, about a third of
the samples have consent that is consistent with recontact. So for
a subset of these individuals, we can actually go back. If we find interesting
genotypes, we can actually go back and see what sorts of phenotypes
they have, or at least get some limited information
about their clinical status. So I’m showing this slide purely
to indicate that this has been a project that is
involved in an enormous number of data donators, so
principal investigators who have allowed their data to be
used in this process. I think at last count we’re up
to about 107 PIs who signed an MOU that allows their data
to be used in this project. So a huge thanks to them and,
of course, to our funders for
making this possible. So this is one of the first things we did
across the gnomAD data set. It’s a principal
component analysis, basically just showing the first
three principal components, which correspond,
broadly speaking, to geographical ancestry
across the gnomAD individuals. Here we’ve generated this based
on about just over 50,000 well-called single
nucleotide variants. And we have inferred the
population labels using a random forest approach with a set of
about 40,000 people of so-called known ancestry. So what we can see in this plot
is that we have a pretty diverse set of individuals who
have been sequenced here. A big cloud of individuals
of European ancestry, we have clusters corresponding
to Finnish ancestry and Ashkenazi Jewish. Big cloud up here of South Asian
individuals, and we also have clusters corresponding to
individuals of Latino ancestry, African, and African-American
as well as East Asian ancestry. So this is by no means a
comprehensive sampling of human genetic diversity. In fact we’re not even close. We’re missing, for instance, enormous swathes of
diversity in Africa. The Middle East is missing, and this is because of the
ascertainment vices that have gone into generating the samples
that are going to have gone into this dataset. But what we do have is
thousands of individuals, for each of a set of relatively
major continental groups. And this is really
important if we wanna be able to assign relatively
confidently population frequency estimates for
variants that we find in rare disease patients
from those continents. And if you’re interested
in accessing the data, one of the key principles of
ExAc and gnomAD, actually right from the very beginning was that
we wanted this to be a resource, not just for us, but for anyone who wanted to look up
variants on their patients. So you can go to
gnomad.broadinstitute.org. You can look up
your favorite gene. I’m sure you guys all
have a favorite gene. When you do, you will find these
blue peaks that correspond to the coverage across the exomes. And the green line represents
the coverage across the genomes, across the exomes of
that particular gene. Below that is a big long
table of variants that tells you every variant that we have
discovered within that gene across both the coding and
the noncoding regions. And you can click on those
to learn more about how common they are in
the population and how confident we are that
they’re actually real. Including for most variants
actually a snapshot of the role read support, telling us that that variant
is actually a genuine variant. I mean, you can also download
a site’s PCF that gives you a complete list of every
variant that we’ve called as well as its frequency
across populations. And that’s available for use in
any setting without embargo or any restrictions
whatsoever on use. So I wanted to talk about some
of scientific things that we can do once we have a collection
of 135,000 exomes and genomes. One of the things that we spent a lot of time thinking
about is ways that we can look at not just the variants that
are present in this data set. But also the variants that
are missing from the data set. In that sense, I mean, informatively
missing from the data set. And so to do this,
we worked very closely with Caitlin Symoka, a graduate
student in Mark Daily’s lab. She had developed a statistical
model that allowed her to predict, actually with
quite high confidence, given non-mutation rates how
many variants of a particular functional class we should
expect to see in each gene in the genome in a collection
of 60,000 people, or 135,000 people. And so this allows us to say,
you can see in this plot here for a variant type like
synonymous variants, for instance, which are variants
that are found within coding regions, but don’t change the
amino acid sequence, how many synonymous variants we would
expect to see in each gene. That’s shown here on the x-axis
under Caitlin’s model. Each dot here is a gene. And on the y-axis, we have the
number of variants of that type that were actually observed
in the 60,000 people in ExAC. And you can see that the fit for synonymous experience
is extremely good, an r squared of about 0.98. And so that tells us that this model
is actually very well calibrated for variants that aren’t subject
to strong natural selection. But that relationship
begins to break down. In fact, it breaks down very
severely as we start looking at variants that do actually have
a functional impact on genes. So here we’re focusing
on genes that are, what we call,
loss of function genes. That is they’re
predicted to break the function of
a protein coding gene. And you can see here that the relationship
here is far from linear. Almost all the genes in
the genome have fewer loss of function variants that we
would expect to see by chance. And it is not surprising at all. What it tells us is that for
most genes in the genome, loss of function
variants are bad. And so as a result, they tend to be removed from the
population by natural selection. What’s cool about this, though,
is that with 60,000 people and even more so
with 135,000 people, we can actually say for many
of the genes in the genome how far below the line
does that gene fall. And therefore, how strong
is the action of natural selection that’s acting on
what’s the function variants in that particular gene. So that really gives us a
measure about how critical that gene is and
how bad it is when you break it. So just to give you two
examples, this is a gene called DYNC1H1 where we know exactly
what happens when you break it, heterozygous Missense and loss
of function variance in this gene result in some pretty nasty
neurodevelopmental disorders including seizure disorders and
intellectual disability. Kaitlin’s model is very
well calibrated for synonymous variance
in this gene. The gene is missing about
two-thirds of the missense variants and virtually all of
the loss of function variance that we would expect
to be see by chance. We’ve spent,
would expect to see 161. In fact, we only see four. And of those four,
two of them are artifacts. One is a sequencing error. One is an annotation error. There are in fact two people
who appear to be carrying loss of function
variants in this gene. We don’t have finished updated
data about them and in fact, they may actually suffer
from these disorders. So that’s when we know what
happens when you break the gene. This is a gene where we don’t
know what happens in humans. And this gene, UBL5, we know is
involved in ubiquinone pathway. We know if you knock
it out in mice, it kills them during
embryogenesis. But there’s no known human
loss of function phenotype. But it is has a profile that’s
extremely similar to [INAUDIBLE] on H1, telling us actually
with very high confidence that heterozygous loss of function
mutations in this gene do something, probably
something pretty nasty. Probably resulting in some kind
of relatively severe human disease phenotype. The fact that we don’t know what
that disease phenotype is yet could mean one of two things. We just haven’t sequenced
the right people yet or it could be that, in fact,
this results in embryo loss. So heterozygous loss of
function results in people dying before they can
actually be born. So overall we identified
more than 3,000 genes with a near complete loss of loss of function variants
compared to expectations. We call these high in genes
with a high probability of loss of function intolerance or
high pLI. And that includes almost
all of the known genes, where knocking at a single copy
is enough to cause disease. But importantly,
more than 70% of these genes, we have no idea what they do. There’s no associated
human phenotype. In fact, for most of them, we don’t know what
the molecular function is. And so we now prioritize
the genes extraordinarily heavily when we start looking
at disease phenotypes. And this has already proven
extremely useful in narrowing down on genes in autism,
schizophrenia, and a whole range of other very
nasty pediatric diseases. So the last thing I wanted to
talk about was an example of how we can use these large cohorts
to understand the precise impact of individual variance within
a gene on disease risk. And this work that’s been led
by Eric Minikel, this was a very personal project, for reasons
I’ll describe in a minute. And what happened while Eric Minikel
was an analyst in my lab. So Eric came to my lab extremely
motivated to work on a gene call PRNP which includes
the prion protein. And his motivation for
this was his wife, Sonia, who’s shown here. Sonia is shown with her mother, this photo was taken at the
beginning of 2010 when Sonia’s mother was completely healthy. Over the course of that year, she underwent a pretty
terrifying cognitive decline. Went into frank dementia, and awfully passed away at
the end of that year. So it was about a nine-month
period between the onset of symptoms and her dying. At the time that she died, it was completely unclear
what had actually killed her. But on an autopsy, it was found
that her brain was full of accumulations of a particular
protein called prion protein. This is, in some cases,
a genetic disease. So her prion protein
gene was sequenced. She was found to carry a
mutation that was known to cause in a dominant way, a very
severe disease prion disease called Fatal Familial Insomnia. So Sonia had a 50% chance
of actually carrying that same mutation. She then had, was faced
with a choice of whether or not she wanted to
be tested herself. Her mother had died in her 50s. She was at the time
in her early 30s. She and Eric thought about it
and then decided to go ahead and get tested. And in early 2011,
Sonia learned that she too carries the same variant
that killed her mother. So as far as we know, everyone who carries this
particular variant has a 100% chance of going on to
die from prion disease. There’s many ways of responding
to this kind of incredibly catastrophic information. Eric and Sonia responded in
a way that I, still to this day, find inspiring. They quit their jobs. Eric at the time
was a town planner. Sonia was a lawyer,
a Harvard-trained lawyer. So they abandoned their
respective careers and retrained completely as
biomedical scientists. Initially with Sonia
training in the wet lab, and Eric focusing on
bioinformatics work and that’s what they continue doing to this
day, as I’ll show you later. So when Eric came to my lab,
what he wanted to know more than anything was an answer
to this question. Given that Sonia carries
this particular mutation, how likely is it that she will
go on to get prion disease. This turns out to be
a little bit uncertain, it is a question of what
we call penetrance. That is the probability that a
carrier of a particular disease, genotype, will actually go
on to manifest that disease. This is known to be
incomplete and variable for many disease causing mutations. So, in BRCA1 for instance,
we know that woman who have variance in this gene will go on
to have somewhere between a 60 to 80% chance of
contracting breast cancer. But that has a big error
bars associated with it, and not every woman who has a
BRCA1 mutation will go on to die from breast cancer. And in fact, for
most disease-causing variants, even ones that are relatively
well-established, we don’t actually have a good
quantitative estimate of what that penetrance is. And that’s also true for
prion disease. So prion diseases we
know are relatively rare, lifetime risk about 1 in 5,000. 85% of the cases, we have
no idea what causes them, they appear to be sporadic. But as I mentioned,
15% of these cases are genetic, a notorious 1% of prion
disease cases are acquired. And they can be acquired through
ingestion of contaminated meat for instance qs was the case for
the variant Creutzfeldt-Jakob disease outbreak in the UK, it
was caused by mad cow disease. The genetic cases are due
to set of over sixty known dominant gain of function
mutations in the gene. And what’s amazing about
this disease is that we have incredibly good
surveillance data. Basically everyone who is
diagnosed with prion disease gets reported to
a surveillance center, and for most of those individuals,
the prion gene is sequenced. And so we have an almost
complete collection of genetic information from diagnosed
prion disease cases across the western world. And so what Eric said about
doing was collecting a set of information that would allow
us to compare the frequency of these mutations in prion
disease cases versus controls. So we started with a set
of population controls and here we reasoned that completely
penetrant versions in aggregate for a dominant disease should be
no more common in the population than the disease
that they cause. So that means that we can run
analysis even in a set of samples like ExAC, where we
don’t know exactly whether people will go on to
get prion disease. But we can assume relatively
confidently that the frequency of prion disease causing
mutations in that population will be no higher
than the actual frequency of prion disease itself. So we collected our 60,000
ExAC exomes all of whom had effectively perfect coverage for
the prion gene. We’re also able to work with
23andMe, to bring in information from about half a million
of their customers, all of whom have
been genotyped for 16 reportedly disease-causing
stamps within that gene. And then in a series of
feats of pretty remarkable political jiu-jitsu, Eric was
able to collect over 10,500 cases of prion disease, who had
had the PR and P gene sequence. This constitutes pretty
much every known or probable prion disease case that
has appeared in the US, Europe, Australia, and Japan for
the last 15 to 20 years. It’s the biggest collection of
rare disease cases that I have ever worked with. And probably the biggest
collection of cases that we’ll ever get for
something like prion disease. So this is a really remarkable
data set to do a comparison on. And so we can calculate pretty
easily how many individuals we would expect to see in our
60,000 ExAC individuals, who actually do have real variants
that will cause a prion disease. And if you multiply these
slightly hand wavy numbers out, you end up with a number that
is somewhere between one and two individuals who carry a
mutation that will go on to kill them of prion disease. So that’s the expected number, the actual observed number
turns out to be much higher. So we found 16
different mutations. Sorry, sorry, I think it’s 12
different mutations in PRNP that had previously been reported
to be pathogenic and 52 individuals carried them. So that’s 30 times higher than
what we would expect to see, given the frequency of
the disease in the population. There’s multiple possible
explanations here. It could be that prion disease
is actually terrifyingly more common than we think it is. We’re pretty confident
that’s not the case. So by far, the most likely
explanation is that some of these previously
reported disease-causing mutations actually don’t
cause disease in everyone. Or perhaps, in some cases,
don’t cause disease in anyone. They’ve been falsely reported
as being pathogenic. So to explore that, we generated
a plot that looks like this, and as well as a bunch of statistical analysis that I’m
not gonna go through here. This plot is very simple,
basically what is shows in the x-axis is the number of
times we observed a particular reported prion disease causing
a mutation in our cases. So that basically tells us
how common it is in patients who went on to get the disease. And on the y-axis we have
the frequency of the same variants in x-axis. Again, a set of population
controls broadly representative of the general population. And the variants then fall up
to three broad categories. Along the x-axis, we have
variants that are seen in cases, but are seen never or
very rarely in controls. These four variants along this
axis all turn out to be variants with incredibly strong evidence
for actually being pathogenic and disease-causing And as far
as we can tell, every individual who has ever carried these
variants has indeed gone on to die from prion disease if they
survive into their fifties. And unfortunately that
does include D178N, which is the variant
that Sonia carries. On the y-axis, we have a set of
variants that are at relatively low frequency in cases. They’re also at
relatively low frequency, basically the same
frequency in controls. So there’s no information here
whatsoever to indicate that these variants are actually
associated with prion disease. And in fact certainly for these
variants that cluster up here, we can very confidently rule
them out as having any effect whatsoever on prion
disease risk at all. So these are variants that have
accidentally been associated with prion disease,
because they happened to be in an individual who died from
sporadic prion disease, and as a result they ended
up in the database. And we can now in most
cases throw those away. And then there’s
a set of variants, really interesting
variants in the middle, who are too common in cases
to be completely benign, but also too common in controls
to be fully pathogenic. And these are variants that
clearly are a product of incomplete penetrance, that is, they cause diseases in some
people but not in others. And we can calculate with
some confidence the actual, absolute lifetime risk
of prion disease, for people who carry each
of these mutations. And that ranges enormously
from 1 in 1,000 risk for this variant up here, but
still confidently above 0, and down to about 10% risk for
this variant down here. That has profound implications
for genetic counseling. For all three of these variants,
there was information in the literature that suggested that
there was incomplete penetrance, but we had no idea what these
risk estimates actually were. And now people who are working
with families affected by these mutations can go back and give them a relatively well
calibrated estimate of how likely it is that another
family member who carries those mutations will go on
to get the disease. Just to give you an example
about how this information can change lives, there was an opinion piece that
accompanied the publication of this prion disease work
back earlier last year. And in that, Robert Green,
a clinician at Brigham and Women’s, describes a patient
that he was seeing at the time. This patient’s mother had died,
like Sonia’s mother had died, of prion disease. It was unknown what
the cause was. This patient had her
prion gene sequenced and was found to carry
a mutation called E196A. And she had been told as
a result that it was very likely she would go on to die from
prion disease in her 50s. It turns out that E196A is the
mutation for which we are by far the most confident that this has
absolutely no association with prion disease. This is a completely
accidental finding. And this woman, Robert was able
to return that data back to her. He was also able to say that,
looking back through the literature,
there is no information on segregation of this variant
within prion disease families. There is no family
history in patients who carry that particular variant. And the frequency is absolutely
no higher in cases than it is in controls. So her probability of going
on to get prion disease is no higher than
the population average. So basically, Robert was able
to commute a death sentence in this particular case. And that, of course, can have transformative
impacts in other ways as well. So this information did not
change, as far as we know, Sonia’s risk of going on
to get prion disease. But I think it did, in many
ways, reestablish Eric and Sonia’s commitment to working
as hard as they can to come up with a therapeutic approach to
reducing the levels of prion protein in Sonia’s brain. To the level that it’s actually
possible to ameliorate her risk of going on to get
prion disease. In this case, the therapeutic target is
actually extremely clear. If it’s possible to get
a chemical agent into Sonia’s brain that reduces the overall
levels of native prion protein, it’s almost certain that she will be able to
reduce that risk. And Eric and Sonia are now working at
a lab at the Broad Institute. They’re PhD students at Harvard,
but amazingly have their own
lab at the Broad Institute. They’re working in collaboration
with a number of industry partners to produce compounds
that will hopefully be able to produce exactly that effect over
the next 10 to 15 years, which is the time that Sonia has left
to come up with such a therapy. And of course, we all wish
them the best in that quest. So we have a lot of work to do
to be able to expand gnomAD to be able to answer
similar questions for a whole range of
other disease genes. We’re very interested in
continuing to expand this. We will be expanding this to over 65,000 whole human
genomes with a new core set that will hopefully
be generated later this year, calling structural variants
across the entire dataset. And a set of analyses focused
on protein-truncating variants, as well as looking at constraint
in non-coding regions. And I think the key points
that it’s worth emphasizing in the context of this presentation
is, firstly, the naturally occurring human genetic
variation is really useful for understanding both the impact of
variation in a disease setting, but also in terms of
understanding human biology. That the more humans
we have sequenced and phenotyped, and the more diverse
their ancestry, the better off we are in terms of
understanding this disease risk. That everyone carries
rare genetic variants. And it’s true that most
interesting variants are rare, but it’s also importantly true
that most rare variants are not actually particularly
interesting. And these large scale
reference databases help to distinguish between
those two scenarios. And finally, one of the things
that I think has been a real strength of ExAc and
gnomAD has been, we’ve been very fortunate to
benefit from a human genomics community that’s been willing to
share data openly and rapidly. But there’s still a lot of work
that we as a community can do to make that process better. And it’s really only
with that open and rapid data sharing that we can
build resources like this and make them available
to the world. So with that, I’ll finish by thanking
everyone who’s been involved in generating and analyzing
this important resource. Again, including the analysis
team, production team at Broad, the individuals who built the
gnomAD website, and once again, the genomics platform and
the Hail team for all of the critical work
in building this data out. Thanks to you guys for
listening.>>[APPLAUSE]

Danny Hutson

Leave a Reply

Your email address will not be published. Required fields are marked *