Database and Data Analytic Systems

Database and Data Analytic Systems

>>Welcome to the Database and the Data Analytics session. This is of course
a very important area, but not an entirely new area. But today, we have
three excellent speakers and they’re
going to talk about some of
the very new directions that the field is taking. So, one common trait among
them is the intersection of this field with some of the techniques in
Machine Learning. The first two talks
that we are going to hear are going to be around leveraging some of these
Deep Analytic techniques for some of the classical database
operations and challenges. The last talk is going
to focus on saying how can Data Analytic platforms
who are, of course, for a long time and
traditionally have been used to understand
what the data is, what the insights in the data is, can more easily adapt to the challenges of doing Machine Learning on
these platforms. So, with these three talks, you’re going to see
both dimensions of the work that’s going on in
the field which is quite new. So with that, we’re
going to start with the first speaker
who is Andy Pavlo. Professor Andy
Pavlo is in CMU and he is the leader of the Database Group and the Parallel Database
Laboratory there. He’s a winner of the SIGMOD 2014 Jim Gray Dissertation Award, as well as the Sloan Research
Fellowship Award for 2018 and ACM SIGMOD
Best Paper Award. In addition to
doing excellent and innovative work in
database systems, Andy is one of the most
entertaining speaker and he has promised
us to kick off this session on a high note. With that Andy, take it away.>>I promised you that I
wouldn’t get you fired, that’s what I promised you.>>That promise is also very important and my manager
is in the audience, so I have to be
particularly careful.>>Right, okay. All
right. Thanks, Surajit. So, I want to talk about
research we were doing at Carnegie Mellon on
the context of how to build autonomous or
self-driving database systems. So, by autonomous, I mean a database that’s
able to configure, tune and optimize itself automatically without
any human intervention. So, there’s another talk
I could give about the Machine Learning methods and algorithms we use to
make this happen. Instead, this talk,
I want to talk about the things you need
to have when you’re building your database system in order to make it amenable to being tuned by an autonomous component
or planning framework. So, a way to think about
this, for this talk is, these are the engineering
problems we have to overcome in order to have the
Machine Learning to stop being able to control us. It doesn’t matter how great your Machine
Learning algorithms are, how much training data you have, if you actually cannot expose
the right interfaces or API to those planning
frameworks and then take the actions that
they suggest and apply them, then all of it is for nothing. So, today’s talk, I’m going
to break it into three parts. So first, let’s talk about some background of what
work’s been done in the context of
autonomous databases for the last 40 years. Then, we’ll talk about more of the engineering side of things, again, how to support
Tom’s operation. Then, what kind of
database talk would I give if we didn’t finish off
with a rant about Oracle. So, we’ll finish up
with that. So, the idea of autonomous
databases are almost as old as databases themselves. The first relational databases, when they came out in the 1970s, immediately people
recognized that there was a need for these
systems to be able to adapt themselves and optimize
the performance of storage and indexing without having a human
tell it everything. So, back then, in the 1970s, this line of work was called
self-adaptive systems. It did things like
picking indexes, it’s probably
the most common one. So, at a high-level,
these early self-adaptive tools worked a lot like
they work today, where you have some human prepare a workload trace
of previous queries that the application has executed and then you’re gonna feed them
into your tuning algorithm, but then it’s going
to crunch on them, compute some internal statistics about the problem
you’re trying to solve. So, if you’re trying
to pick indexes, you’ll figure out what columns are accessing those
often in queries. Then, you use
this internal cost model to evaluate a bunch of candidate actions or
candidate indexes and then weight them accordingly. Then the tool would spit out a recommendation to
a human who then had to make the final
decision about how and when to apply
it to the database. So, at a high level again, all of the self-adapted tools work essentially the same way. So they handle things
like index selection, data partitioning, sharding
and data placement. So, to just give you an idea
how old this problem is, there’s a paper in 1976 on doing self-adaptive
systems for index selection. This paper was
actually written by my advisor’s advisor
who is now dead. So, this again, people
will be thinking about this for a long time. So, now, the next major trainer chapter in
autonomous systems came along in the late
1990s and 2000s. This entered the era of what I’ll call self-tuning databases. Again, at a high, all these tools are essentially doing
the same thing. You take some workload
trace, crunch on them and they spit
out recommendations. So, all the major
vendors at this time had their own proprietary tools to solve these types of problems. I’m not just saying this because
we’re here at Microsoft, but it’s my opinion that the AutoAdmin project by
Microsoft was at the vanguard, really at the forefront of this. There’s a lot of great groundbreaking work
that we’re actually building on today that
came out of this project. The seminal paper would be
this 2007 paper from Surajit. It talks about the decades worth of work they did on this project. So, the other thing we saw in the self-tuning
systems around this time was the need to
do knob configuration. So, knob would be a configuration parameter
you could set in the system to control the runtime behavior
of the database, how much memory do you
use for your buffer pool. So the reason why there was now a need for automatic tools
to be able to configure these things is
because the number of knobs that these
systems actually supported kept increasing and things were getting more complex. To give you an idea
how bad it is, one of my students went and took two major open-source
databases MySQL and Postgres and went
back 15 years and looked at all the
different releases and counted the number of
knobs that they had. In over a 15-year period, Postgres increased
the number of knobs by five x and MySQL increased the number of knobs
they had by seven x. So this point, this is well beyond what any human
can reason about. So, there’s automated tools that are needed to
make this happen. So, now we see in
the 2010s we all recall when we entered the era of Cloud-managed databases. This is exactly what we
just saw in the keynote, where it was not so
much about how to tune individual databases was now
at a service provider level, how to essentially do
a bin packing problem. How to place the tenants on the various
hardware that you have available in order to maximize the performance by minimizing
your hardware costs, and you may need to actually
migrate things over time. So, this is again, this is doing
autonomous databases, but at the service
provider level or the operative level where you’re not actually tuning
individual databases, you’re just looking at
thousands and thousands of tenants and trying to decide what’s the best way
to lay them out. So, now, given I just
spent this time, the beginning, talking
about the last 40 years of autonomous databases, why is all this work that
people have done before, why is this insufficient for achieving a fully
autonomous system? I’ll say there’s three
reasons why this is not entirely what we need. So, the first is that a lot of these tools still
require a human to make a final judgment about the recommendations
or suggestions from these tuning tools. So, a human had to
look and say well, “It’s asked me to
build this index. Is that actually the right
thing for me to do?” Then the human actually
had to then decide when it was the right time to deploy that suggestion and
how should they deploy it. So they had to know
that three AM on a Sunday is when I have
the lowest amount of demand, so that’s when I go ahead
and start optimizing things. The next problem is
that these are all reactionary measures, a lot
of them are reactionary, meaning they look at
the workload trace in the past, identify problems and try to cough suggestions and
avoid them in the future. They’re not looking down the road looking at various trends
and saying, “Well, this is what my workload
is going to look like a week from now
or a month from now”, and prepare the
system accordingly. This is essentially what
humans are doing now. They do capacity planning
based on what they think that the workload is going to look like in the future. The last problem is
that they’re not, not be able to transfer any of the knowledge
that they learned from tuning one particular database and apply it to another database. So what that means is that,
say I run my tuning tool or my single database instance
and then I come along with another application I want to tune this database for, unless that application has exactly the same workload
on the exact same hardware, you can’t reuse anything you learn from the first
time you ran it. So, these are the
reasons why I think all the existing work is not enough in order to have
a fully autonomous system. So, now you may be
asking why am I so confident that it’s going
to be different this time. So, I would say
that the reason why I’m optimistic about
our ability to have a fully autonomous
system is essentially the same story why Machine
Learning is hot now. We have the storage capacity
actually store a lot of training data and then we have better frameworks like
Torch and TensorFlow, even a crunch on it
and then we have hardware accelerators like
GPUs and CPUs available to us, to be able to take large amounts
of this training data and derive the
models that we need. So, this is why I’m
bullish on this area, and I think that there’s a lot
of interesting techniques from Machine Learning and
the AI community that we can apply for our databases
to essentially complete the circle from what people
have done in the past. So, to give you an idea what some of these
projects look like, I want to talk about
two things we’ve been building at Carnegie Mellon
for the last couple of years. The first is this thing
called OtterTune, which is a knob configuration
tuning service. This is designed to work on existing systems like SQL
Server or MySQL Postgres, and treating them as a black box, and trying to figure out
what information can we derive from them to help
our tuner knobs better. Then we have another database
has been called Peloton, we’ll be burning
this from scratch, where the idea is to take a clean-slate approach
at designing a database system
in order to make it be managed by
an autonomous system. So, I’ll briefly go over
each of these one by one. So, as I said, OtterTune is a database knob
Tuning-as-a-Service. The idea is, you come along with your database installation, you connect it to a service, you upload some metrics, we crunch on them, and then spit back a knob recommendation, and then we apply that observe your changes and see
whether that helps. In this feedback loop, you get better and better. The key difference
about OtterTune than all the previous work, is that we can actually
re-use the information, or the data we’ve collected from previous tuning sessions to help speed up tuning
newer sessions. So, that means if
you come along with your application that we’ve
never seen it before, but we collect some metrics and we look to see how
it actually looks very similar to this other application we tuned in the past, since we knew how
to tune that one, now we know how to tune yours. So, just as a quick show
what performance we can get, so we did a simple
experiment where we took MySQL and Postgres, we ran the TPCC benchmark on
a small instance on Amazon. We’re going to compare
what OtterTune can do against the default
configuration you get for these systems versus some open-source tuning scripts, or the configuration you get from Amazon when you run an RDS,. Then lastly, we’re also
going to compare it against very expensive human DBAs tuning these two
systems manually. So, in the case of MySQL
what you see is that, the OtterTune and the DBA actually performed better than
all the other approaches. Now, the reason why the DBA
actually beats us here, is because this is the top
MySQL DBA from Facebook, and there’s some knob
in MySQL that affects whether you flush every right when each
commit a transaction, or how aggressive
you flush things. At Facebook, they were
okay with turning that feature off so that it allows you get a little bit
better performance. This is actually
an important point when we talk about
the engineering side, because this actually
a human judgment has to be made about whether it’s okay to reduce the durability
of your transactional data. So, we blacklisted
this knob from OtterTune. So, we purposely said it’s
not allowed to tune this, and then that’s why the DBA
was actually able to beat us. In the case of Postgres, we see the OtterTune
actually does better than all the
other approaches, and this is because it’s able
to find the sweet spot for this particular workload and harbor configuration for
tuning the buffer pool size, that’s the log file size. This again, we’re up against
a very seasoned DBA, and we are able to beat them. Again, the main thing here is that OtterTune is doing as good, if not better than
very expensive human DBAs, with only training
for about an hour. Now, the next project is Peleton, and we’re pitching this as the self-driving database system. Meaning that we want
this thing to be entirely autonomous. So, we’re building this
from the ground up to be entirely controlled by this self-driving framework
we’ve been building. The idea here is
that we want to see, what advantage do we have when we have complete control
of the entire system? How it allows to do things that we cannot easily
do with OtterTune. What I’ll say, is that we
actually considered maybe using MySQL Postgres
initially for this project, but what we found is
that there’s just way too many things that require
the system to restart, have other problems
that make it not amenable for
self-driving operation. So, we decided to bite the bullet and write a system from scratch. There’s a whole other talk I can give about the trials and tribulations of building
a database system from scratch in academia, which is been quite a journey. So now, I want to talk about what I’m calling design considerations for
autonomous operation. So again, the idea here is that, how should you build
a database system in order to support itself, to have itself to be
able to be controlled by planning framework that wants to run entirely
autonomously. So, for this, the re-occurring theme that
we see over and over again, is that these
engineering decisions we can make are going to help us reduce the complexity of the solution space for our autonomous
planning components. Because, otherwise
we need a lot of training data in order
to train these models, and it’ll take
a long time for them to converge to a good solution. So, there’s some tricks we can do to reduce the number
of choices we have to consider and maximize the reuse of the training data that
we want to build upon, and that allows us to get better optimizations
more quickly. So, I’ll first start talking about configuration
knobs and metrics, so the idea is here’s
what API or data you expose to the planning framework, and then we’ll talk
about how do you take the actions that
the planning framework suggests, and apply them in
an efficient manner without slowing down the rest
of the system. All right. So, the most important thing you have to have for
your configuration knobs is the ability to mark which
ones should or should not be tuned by your
autonomous components. We saw this before when we
talked about the Facebook DBA. There’s some flag that Facebook allowed
their DBA’s to set, that we decided that
we weren’t going to set in our autonomous
components because a human had to make
a final value judgment about whether it was
okay to do that. So, some things are obvious. Like file parse, network
addresses, and port numbers, but if you don’t set
these things then system doesn’t boot properly at all. But things like durability
and isolation levels, these are actually judgments
that have to be made by the company organization about whether that was
an okay thing to do. Is it okay to turn off async when you commit a transaction? Are you okay with losing the last 10 to 20 milliseconds
worth of transaction data? Some places, that’s okay, some places that’s not okay. But the machine learning algorithms aren’t
going to do that. There’s some other more nuance
things like harbor usage, like how aggressive
you want to be doing compaction if
you have an LSM. Because this will cause you
to wear down the device more quickly and have
to buy new hardware. Recovery time is another one, like how long are you
willing to have the system to take to recover
if there is a crash? We can’t control
these things or we can’t know these things in our Machine Learning
model across models. So, we have to have a human to be able to tell us and therefore, we need to blacklist him. The next is one is
that we need now hints about how we should
actually tune these knobs. Again, this is necessary
again to reduce the search space or solution space of the actions that we could apply
in our system. So, the most obvious one, is our min-max ranges
for a parameter. How much memory should I be able to allocate to a buffer pool? There’s other things we
see that any time you have a knob that you think
can’t be controlled, meaning turn it on and off, you should always
separate them from the actual knob that
controls its behavior. So, sometimes we see
knobs where you can rate them how fast it’s going write data at the desk in
terms of kilobytes. But a lot of times, you’ll
use a special value like negative one or zero to
disable that feature. The problem is, the
Machine Learning algorithm will find that setting, the amount of data
you’re reading out the disc to zero kilobytes
makes it go faster. Of course, that now that
means you’re not writing data to disk, and you’re
going to lose data. So, if you just have a separate
boolean flag to control it separately from
the actual parameter, that makes things easier. Another more nuanced one
is the ability to control how you actually
modify these values are knobs. So, in most cases, in those systems we’ve seen, you can set absolute values
for a certain parameter. Like how much memory you want to use for your buffer pool, you can set it from 0 to 32. We think the better way
to do this is just have increments that
are based on delta that go up and down
for these parameters, and that cuts down the number of actions you have to consider. But now the problem is, if you have a large range
of values you could choose, then incrementing
by a fixed amount throughout that entire range
causes problems. So, what you really want is
non-uniform deltas like this. So, say if I want to
set the amount of memory of music in
my buffer pool, if I’m between one kilobyte
and one megabyte, maybe just increment by
100 kilobytes at a time. But if I’m up that
one terabyte of DRAM, I don’t want to increment
by one kilobytes, because it’s not going
to get big difference from one setting to the next. So, you can expose the deltas
to the tuning algorithms, and say only increment by these values when you’re
within these ranges. Now, the next thing we’re
going to talk about is, how do you expose metrics or information to
the outside system, and we are going to use these
to build our cost models, and do reward calculations for
our actions we can choose. So, the most important
thing we find, is that anytime your database
system supports the ability to tune
individual sub-components, you are going to
make sure you expose individual metrics for
the sub-components. So, an example
would be in my DB2, you can set the number of
buffer pools you want, and for each buffer pool you
can tune them separately. You need to make sure that
for each buffer pool, all of the stats you care
about are exposed to you, and not so aggregated together. The biggest offender we find is actually for this, is RocksDB. So, in RocksDB, you can set, for each table you can have
multiple column families. You tune each column
family individually. So, they do expose some metrics about each
column family, but the problem, is the most important thing
you need to determine whether your performances
get better or worse are the number of reads and
writes per column family. But they’re not included in
the subcomponent metrics. Instead, you can only get this from a global metric table, where it’s been
combined together. So, now you need a lot
of training data in order to determine
whether the changes you’re making to
one particular column family is benefiting or hurting you, because you have to
then extract that from the aggregated metric. So, now we are going to talk
about how do you actually take the actions that the album select and
apply them efficiently. So the most important
thing I would say, and part of the reason why with Peloton we decided to build
a system from scratch, is that we find that
many cases changing a knob or applying
an action requires you to restart the system in order
for that to take effect. Now the commercial
systems are much better than this than
the open-source guys, but even the SQL
server Oracle and DB2, there’s enough knobs in there that require you to
have to restart. The reason why that’s
problematic is because now you need to take
into consideration, in your cost models, the time it takes to
restart the system. Or you have to ask a human whether it’s okay to restart
the system right now. So, we think that there should be never any knob that will require you to have
to restart and this includes scaling up the hardware, as well as applying changes that modify how it actually
uses the hardware. Another very important thing is that you want to have
the ability to have your replicas diverge and have
different configurations, in order to get more training data about the different configurations
you could have. Alright, so a very common setup
and databases is that for high availability
you have a master and some replicas, right? And anytime the master
goes down then you elect one of the replicas
to become the new master. So, normally you would
have these things have the exact same
configuration because anytime to master
goes down you want the replica to be able
to come up right away. But instead what you can do
to get more training data, you could have the master had the best configuration
you ever seen, but in your replicas you’re trying out other
configurations see whether they’re actually better than what’s on the
master right now. So in this case here,
the replicas could be trying out different
index selections, and seeing whether
they’re actually improving performance. Let’s say for this top guy here, this is actually doing
better than we can recognize that it’s okay for us to
deploy this on the master. Right? The reason why
we want to do this in the replica is we
don’t want to slow down the master node because we need to meet
our SLAs and SLOs. So now the problem is though, let’s say that this bottom guy here we’re trying out
a configuration that actually is bad and
therefore we’re falling behind the workload
that’s on the master. Right? So, we recognize now we are too far behind
we want to go ahead and kill that configuration and restart the system because we don’t want to have to crash, if the master crashes, we don’t want to replica have to take five minutes to
get caught backup because now that’s downtime we have to include in our system.>>[inaudible]>>Yes. Perfect. So,
there’s a bunch of other stuff you have to
consider for this like can you, if you’re in a Cloud environment, can you scale out additional BMs to collect more
training data. How do you generate
the workloads, so you’re the input sequence to the replicas are different? Right? So, there’s a whole bunch
of assistance questions to actually make this work
and I think it’s interesting. So, to finish up let’s
talk about Oracle. You might have seen
this announcement last year, were Oracle came out and says, “We have the world’s first self-driving Database
Management System”. Larry Ellison got on
stage and he said, “This is the most
important thing that the Oracle Corporation has worked on in the last 20 years”. So, the other thing about
this announcement is that we had a paper in January, a few months before, where we talked about our self-driving
database system, and at no point did
Larry Ellison mention our work. So, I decided to shoot him an email and
I’m like “Hey look, what’s up with that, right?” Now, you may be
thinking it’s puerile for a professor to email
somebody and say “Hey, why didn’t you cite our work?”, But Larry Ellison and I go pretty far back
and so I thought it was okay for me to
email and let them know that I was slightly
displeased with this. Now, he didn’t respond
to this email. He never responds to
any of my emails, but that’s okay we
can still break it down and see what
he’s actually done. So, their self-driving database claims that they have for key features: They do
automatic indexing, recovery, scaling,
and query tuning. Now for the first
three items here, these are essentially
the same things solving the same kind
of problems that the self tuning databases we’re solving in the 2000s, right? And so in talking with
the Oracle developers, and reading their
marketing material, as far as I can tell these
are just the same tools that they’ve been selling before
as DBAs for the DBAs, but now just running them
in a managed environment. So that means they have
the same limitations or problems that I talked about before where they
are reactionary, and they’re not able to
do knowledge transfer. So, for this reason,
I don’t think this is actually really
self-driving databases. They’re just running your stuff automatically and playing tricks, the colors to make it look like it’s fully adaptive
and self-driving. So now, for the last one
Automatic Query Tuning, ff you’re familiar with the
work in academic literature, this is usually referred to
as Adaptive Query Processing. The idea is if you have
a very expensive Query, you run it through the optimizer
when you first start, you generate a query plan and if you notice over
time as you’re running it things it’s not as good
as you thought it was, then you run to
the optimizer again. This is not unique to Oracle. Microsoft announced last year, in Sequel Server 2017, that they have the same feature, but it’s even older
than that, right? If you go back into
the early 2000s, IBM DB2 had a project called
Leo the Learning Optimizer, where they were essentially
doing the same kind of thing, but it’s even older
than the IBM project. It actually goes back to
the 19070s with the first, one of the first relational
databases Ingress, where they are
essentially, they are running the query optimizer, over and over again for every single Tuple you would examine, and they did this not
because they were trying to be sophisticated, they were doing this because they had a really primitive
query optimizer. So, again in this case here
there’s nothing really unique about what Oracle
is claiming they can do, and I wouldn’t say that this
is self-driving at all. What I will say
though, that Oracle is in talking with developers, they are working on
a newer version of this that is certain to incorporate some of the things that I’m
talking about here. So to finish up, I think that autonomous databases are
achievable in the next decade. I think it’s a lot
of engineering work, a lot of systems research, and a lot of machine
learning stuff we can apply to make this
thing all work. I would say for anybody that’s
working on a large system, they wonder how it would be used, whether it’s in academia
or in industry. Anytime you add a new feature, you should not think about how that human going to tune this, you should really be
thinking about how can a machine tune this in
an efficient manner. I’m I exposing the right
information and if I have the right controls that are necessary to make
this thing work. So with that I’ll finish up and I don’t think we have
any questions right? Because we’ll do
a panel afterwards. Okay? Thank you.>>Thanks Andy. Just
hang in for a second. Let’s see if there
are a couple of pointed questions that anybody in the audience has for Andy, before the memory of the talk fades a bit with
the next two talks. So if you have any
questions though, raise your hand there
are runners who will get you a microphone because
these talks are being recorded.>>Yes. I’ll repeat the question
and it makes it easier.>>Yeah.>>Andy, thanks for
the great talk. So, you mentioned a bunch of
driving factors like there being like available
new hardware and available new interesting ways of doing more powerful models
such as Steve nuts. But from from
my experience one of the big differentiators
is also the availability of new data or of more data
that allows us to do better training and therefore build better and much
expressive models. So, I’m wondering where do you
get your data from and you have any interesting
insights of how to do better and more
targeted data collection and selection for training
better models in this context?>>So, his question is that, so Yohannes is arguing that the having a lot of training data is the key thing to
having models that actually we can derive use from. So, what are some things
we can do to improve the collection in
training data? Is that correct?>>Yes.>>Okay. So, the one thing I said before is about
this example here, if you use your
replicas to try out different configurations this is collecting more training data. But this this is helping you to converge to see whether
you’re actually getting benefit from
the choices you make. In terms of the actual
metrics I think that, the systems are doing
monitoring now, we just make sure
that we push them out and build our models
based on them. So, there’s not really
anything in terms of what metrics you collect, not really anything
in there that’s different than what
existing systems do now. In the case of Auto-Tune, we don’t actually need to look at the queries or
the actual data itself, we just take the metrics, the internal performance
counters that the system generates
and spits them out. The other thing I’ll say
though is a lot of times developers just add whenever metric they think off, right? So, there’s a lot of data that it’s actually
not really useful. So, we have
statistical methods to sort of prune out the
things that actually don’t matter and focus on
the main ones and they find the signal and
the noise in that way.>>Carlos?>>Yes.>>Hey. [inaudible]
can you hear me?>>I can hear you.>>Okay. Can you
repeat the question.>>I will.>>So [inaudible]
the question is, I think there is a mental
advantage for a Cloud provider. Millions of databases.
So, how do you see a way for academia to collaborate
with our providers since, we will have the data, and
we’ll obviously want to do this kind of research because all of the advantages
you described. But, how do you give academia a shot basically to help us innovate
on this, right? Giving out data is
obviously very hard. Do you have any ideas of how, beside interns, we
can make this work? Otherwise we adapt
the web search in which there is one or two companies in the
world that do it and everyone else can talk about
something else?>>So, which company
are you working for, or are you in the academia
recently John?>>I think the
advantage is for us. I’m just wondering how we can->>All right. I think
Carlo’s question is, how should academics engage with major Cloud vendors
like Microsoft, and try to apply some of these techniques in
your environment? So, this is why divided
my two projects in taking existing systems versus building the system from scratch. For the kind of tuning we
want to hand on Auto-Tune, you need to have the source code, you need complete control
of everything. So, I think for you guys it
will be hard to do this, plus SQL server is
a huge code base. I think that is something
you guys have to do in-house. With Auto-Tune, since
we’re just looking at, treating the database system
as a black box, I think interns are probably
the right way to do this. I think that we’ve learned enough about SQL server now and other systems
that we come to you guys, and say, “Hey if you
add these features”, but the things we’re
going to ask you for, is going to taking
longer than you can do in a summer internship. So, I don’t know what
the right approach is for that.>>So, I would just add
one small thing that, it is true that we get
a lot of telemetric data.>>Yes.>>There is no question.
But as you know, we are not allowed to look
at the data or the workload. So, even though there is a lot of data in our Cloud service somewhere we’re not
entitled to look at it, that query string or actual data, unless the customer has raised a ticket and for
a limited time and audited, where somebody looks at it. So, it comes with the other balancing thing
on our ability to, in classical way, look
at all the data like Bing or Google can
do, we can’t do. I think, just wanted to add that.>>So, as in aside, I
will say that again, I’m not just here
because he’s here. Just the way that AutoAdmin was the leader in this area
in the 2000s, in my visits with companies, I think Microsoft again,
in the Cloud area, is leading this effort. I’m actually surprised
at how primitive some of the things that are out there in major Cloud vendors
in this area.>>Well, last question. Thanks, Andy. Then,
maybe the last question.>>I can ask question.
So Andy, you mentioned->>We can’t hear you, but I think the microphone is not on. I think that microphone.>>Can you hear me now?>>I can hear you,
I’ll just repeat it.>>Right. So,
my question was that, you mentioned that there was
an increase in the number of contradition knobs in
these various database engines, by what, a factor of
five over the years. But it’s likely that
most of the features don’t really affect the
performance as such, in the sense that the same three or four
important parameters that were there ten years back, would probably be the same
ones that you have right now. So, actually only the DBA needs to look at
those parameters, and the rest of them are
just as you have mentioned, nice amongst the signal. If you are using
machine learning techniques, it’s likely that
you will introduce many new features that
are actually not relevant towards the final goal of optimizing the
system for performance.>>So, his question is that in this graph here I’m showing
that there’s 100 of knobs. But in actuality, most of them
are probably not something that a DBA is going
to tune actively or tune in a daily job. Then, you are saying
that if now we start interfacing the data system with machinery pieces or
machinery components, that is going to
introduce more knobs and more complexity, and
how do we handle that? So, for the first one, I would say that what we
do in Auto-Tune is that, we basically run a similar
less of aggression and figure out here is
the ranking of the knobs that actually matter the most
for a variety of workloads. You are right, it
is about 25 or so. But I would say that it’s more than just the
number of knobs, it is the dependencies
between them, and that just makes the
problem much more difficult. The second thing you
said was that adding in machine learning parts makes this thing even harder to tune. I would say, yes. My lofty research goals, I think we can omit all of us. I don’t think we actually
need humans to do very, very much other than to give
us a credit card number and some initial hints
about how to tune things. I don’t think there is anything that we would need a human
to tune these things for. So, that’s my goal, whether we get there in 10 years, I like to think we
can, but we’ll see.>>Well on that high note,
thank you, Andy.>>Thank you.>>So, our next speaker is
Professor Tim Krasca from MIT. Tim has been in our field of delivery system
for a long time, and his research has focused
in the past on hybrid, I guess in the present as well, human-machine delivery systems
and big data management. Most recently though, he has been taking a very
bold agenda which is to go and look at the
database systems architecture, which is not exactly young, and say that maybe using data and machine learning techniques
we can take a hard look at these component and see
how can be architected? How can they be even better than what it has
been traditionally? I think we are going to
hear some of that today. I also wanted to say that
he has had many awards, I have mentioned some of them, he has IEEE Data Engineering
Best Paper Award in 2013 and the Sloan
Fellowship Award 2017, and most recently
he is the winner of the 2018 VLDB Early Career
Research Contributions Award. With that, Tim take it away.>>Thank you.>>Thanks.>>Thanks. Can you all hear me? In the back as well?
It’s all good? Perfect. So, what I am going to talk about is
actually a paper we published this year in SIGMOD called The Case for
Learned Index Structures. Half of the talk, if you
have seen it maybe before, particularly the
Microsoft people, you know it already. However, I also added
some new stuff with it on multidimensional indexes
for the second half. So, you can fall asleep
for the first 10 minutes. The work we published in SIGMOD mode actually
went viral last year. It all started with a tweet by Christopher Manning
saying that “Machine Learning Just Ate
Algorithms In One Large Bite”. I just want to give
a disclaimer here up front, we by no means believe that machine learning has eaten the whole field of algorithms
and data structures. However, we do believe maybe
it has taken a little bit from it and there’s a lot of more interesting
work to be done. So, why are people
so excited about it? The fundamental building
blocks essentially, of all systems in particular
to data management system, are data structure
and algorithms. We have a whole range
of really of this fundamental data structures from different
sorting algorithms, HashMaps, different
types of indexes, priority queues and so on. All these data structures
have in common that they essentially make
no assumptions about the data. So, now let me give you
just a very simple example. Let’s assume that in
your database system, you want to store and carry all integers
from 900-800 million. You are interested in
range queries for them, you are looking into
scanning for example, all the integers from 100 to something and
their coding records. So, very simple example. In this particular case, probably nobody here would
actually use a B-tree for it because it’s you know that has all the integers
in that range. Instead, what you
could do is from an index structure just like the lookup key
itself as an offset. So what you do is just
lookup into your datastore. The Lookup key minus 900,
which gives you the offset, and you can then
immediately start reading because you know something about
the data distribution. So, now assume that you store all even integer numbers from 900-800 million and
the same trick still works. You take your lookup key
minus 900 divided by two, which gives you again the offset and you can look
it immediately up. So, what we did here
is essentially, we had the B-tree before, which has a lock and lookup time. So, you need to
traverse this and now transform it to something
that is essentially over. This still holds true for other data distributions as long as you know the empirical
data distribution and can compactly represent it. So, the key insight
is that if you take this black box of the data and now just making
the black box by saying, okay, if you know something about your empirical data distribution, about the data you store, you can put much, much better data
structures for them. In some cases, even transform
the complexity class. Now obviously, it’s not efficient to build one system from scratch for every single
used case you have. This is not economical, but this is exactly where machine learning comes in
because it provides us the toolbox to learn these types of models and then take
afterwards advantage of them. So, here I will use
B-trees as a main example, and just for
simplicity right now, I assume that we have
an in-memory immutable B-tree. There’s no inserts and no paging. Just for simplification, I will talk about inserts
and paging later on. So, if you think what
a B-tree actually does is, given a key, the B-tree finds you the position
inside the sorted array. Normally, we page
the sorted array, meaning that we put it into different pages, and
then for example, your only index is the first
key inside the page, just for efficiency reasons that the B-tree doesn’t
blow entirely up. So, in that sense, what a B-tree does given the key, it finds me the page. Then, I need to search inside the page to find the extra key. So, it’s already
an approximate data structure because given a key, you get the position inside the sorted array
and then you have to search inside the page size. So, it’s a guarantee that
you will find it there. In that sense, I can actually replace the B-tree with a model, as long as I have the right key and the model predicts
me the position where this key might be inside
the assorted array, I’m good. Now, the question is
just like the first, how do I still get
the guarantee that I can still find my key inside
the sorted array? Where is it? Like in the previous case, the page defined the mini-max error of
where to find the key. First of all, it’s actually
surprisingly simple. So, let’s assume I have a
monotonically increasing model, what that means is just, I can run every key through the model and then look
on how far it is off, and simply remember the maximum over and under prediction, which gives me
a strong error bond. In the end, it sorts out
but we don’t even need that because given that the whole
thing is sorted anyway, I get a prediction and then
I can do something like exponential search
to find actually the key right in place, wherever it is, without
needing to search everything. I can get the bound and
then with the bound, I could do binary search,
if I don’t have the bound, I could do something
like exponential steps. So, it turns out what
we’re actually doing here, is the index, is like
modeling the CDF of the data. So, given a key, it finds me the position
inside the sorted array, which is just nothing else
than trying to estimate the probability mass of every key which is equal or less than the
key I’m looking up, times, and the numbers
of keys I have in total, which gives me the position
inside the sorted array. This insight is
interesting because this allows us to
take advantage of this whole literature out there on how CDFs can be modeled. Even more interesting, is that a B-tree itself can also be
seen as a model already. It’s a form of a regression tree, and it’s just like a special
form of a special structure, but it’s also a model. So, there’s nothing that even, but we do right now you can
actually see as models, I’m just saying you can
use other things as well. So, what does it mean, and that makes potentially
some people here very happy, that is database people
were actually the very first to do large-scale
Machine Learning. Right? Because we were
the first to do large B-trees, which is nothing else
than building a model. So, what might be
the potential of using other models than the normal
regression trees or B-trees? There are a whole bunch of them. First of all, they might lead to smaller indexes if for example, in the case before, if we know
we have all the integers, I can build a much more
compact representation of it. The only thing I need
to select the offset. Maybe an intercept, and
the slope, and I’m done. It’s much more space efficient. I might get faster lookups. I might get more parallelism, and that is mainly
because for many models, I’m transforming the
very heavy if statements B-tree needs to traverse into multiplications
and additions. I might be able to take advantage of
hardware accelerators, and we heard in
the morning already that there’s a lot of
excitement going on in building FPGAs especially
for Machine Learning. So, now, by transforming them, I might have been advantaged to actually leverage them
also for database systems. There’s also a chance of
having a shipper inserts, I will talk about that later. So, we tried that out very early on like after having
this observation in TensorFlow. So, we built a TensorFlow ReLU, two layer fully connected,
certitude, ReLU activated. We treated over 200 million
web-server log records, we wanted to do
index a timestamp, and essentially the goal of the integrals was
giving me a times, and it gave me the position
inside the sorted array. So, a cash optimized
B-tree for this task, roughly takes 250 nanoseconds. Any guesses how long
TensorFlow takes? The first attempt 80,000. So, this successfully we made just almost three
orders of magnitudes, a little bit more slower, and it’s not very successful. The reason for them
are very manifold. First of all, TensorFlow is designed for very large models, not for these tiny
things which are supposed to run in
the range of nanoseconds. The search doesn’t take advantage that we actually get pretty
close by the prediction, but often our minimax error
is very, very large. It’s like most of the time
we actually close by, so if you binary search
this throws you off. Then B-trees are very
cache efficient. So, if you have a large model, you lose that as well, plus they are great for overfitting. In this case, overfitting, as long as you only have lookups and all inserts
is actually a good thing, because as more you overfit, you better look up
your data you have. Here, I am only
addressing the last two, because I need them later on to explain how a multidimensional
indexes work. So, to overcome the
problem that we have, a very efficient way to
overfit to something, we came up with this structure which we call the
recursive model index, but you can best think
of it’s just like an expert model where one model, the top one, picks another model which was an expert for
certain range of the data. Right. So, we have one
model at the top which takes the key,
makes a prediction, based on the prediction, you pick another model
which no set area better, which might pick another
model. All right. So, the nice thing about that is, it can be very
efficiently implemented. Let’s assume I have
a top model f 0. F, models on stage two, that’s just an array of models. So, what I’m actually doing, is just like I execute
the first model which gives me an index position into the array of models
of the second stage, which then can execute
which gives me the final position if I
have a two stage model. If I now, for everything
don’t use neural nets, but let’s assume just
like linear regression, the whole lookup, as you can see down here takes me
two multiplies, two additions, and
one area look up. This is the execution of
the index, that’s it. There’s some code necessary
for boundary checks, but it turns out you
can also avoid that. You just need to change
the learning a little bit so that the boundaries to do checks whenever necessary. Furthermore, you can,
in some cases even replace some of the modes
using normal B-trees. Right. So, if you say, “Okay, in a certain area
the normal linear model, or neural net, or
whatever you are using, everything is fair game
doesn’t work very well.” So, just replace that piece
of it using a normal B-tree, and you get the normal
guarantees for it. The nice thing of doing that, is the worst case performance in the end is the one of
just like the B-tree, because from the worst-case
just re-grates to a B-tree. We started out with that it turns out like it most of
data sets right now, we never use that anymore, because the complexity of mixing different models normally
doesn’t pay off. It’s easier to just
use more models. So, it doesn’t have
to be neural-nets. The answer is clearly, no. You can use whatever fits and this is essentially what
we are doing right now. We normally use
very simple linear models, sometimes multi-variant models. In particular cases, there are more complex
data is required, sometimes neural-nets,
but very simple ones. The neural nets’ are actually
the most complicated things are the activation function, because they are
often very expensive to execute on the CPU. So, it doesn’t work. What we did here, is just like we compared to a cash optimized, and read optimized Be-tree tuned out to get
the best lookup performance, and compared to
our learned indexes, and what you see
is just overall we get up to a factor of two faster and save up to one order or even
more in main memory. We also compared us against other states of
that index structures, for example files,
look-up tables, fixed size B-trees with
interpolation search, and here it plucks
the alternatives in regard to lookup-Time
versus size. The left lower corner is
where you want to be, and you can see that
the learned indexes is like a dominant solution. So what about like
our assumption I did in the beginning about
‘Inserts’ and ‘Paging’. First of all, ‘Inserts’ might not be such a big problem in that. So let’s assume that
your ‘Inserts’ follow roughly the data distribution
what you have learned. In that sense, you don’t need to rebalance the model, right. You don’t need to change it. You can just reuse
the model and make space wherever you want
to insert something. This is particularly
easy to see if you have mostly like ‘Inserts’. Let’s assume you have
timestamps for IoT devices. The timestamps normally increase. It might increase with
the same patterns what you have
observed in the past, so the model simply generalizes. As a B-tree, you have to
rebalance all the time. Of course, the big
question becomes if it doesn’t follow exactly
the data distribution, how do you still rebalance? But there’s something like robust machine learning which
takes care of that, so there are
new approaches there. The second question
is like paging. Often what we have is just
that we want to page there, particularly if it’s on disk. Here, one simple solution
is actually that you use the model itself to
determine the pages. So normally, a B-tree re-takes what goes
into a page, right. Like if you have
index-organized table, the index tells you what
you should put on the page, you can do the same thing here, and then you get around most of the problems you would
have otherwise as paging. So we started with B-trees, but it turned out
like the same idea also applies to many, many other data
structures, from ‘Join’, ‘Sorting’, “Hash-Map’
‘Bloom-Filters’, ‘Priority Queues’. For example, one of
my colleagues, Mohammed, is working on a new learn
scheduling algorithms for data analytics. There’s work on
cache policies by Google. We are looking into query optimization, what
we can do there. There’s a whole range
where you can apply very similar ideas. Most of the time, it’s
all about learning the CDF of your data and
taking advantage of that. So here, I just want to mention something else which is
multidimensional indexes. So, our big hope when we
started that actually was not to get more efficient index structures for
a single dimension. Our hope was always like, okay, the moment we will
go multidimensional, we assumed we would see
the bigger benefits there, because with machine
learning in general or most of the modes are really, really good in, it’s
just like capturing all the complex relationship
data half, right? So, machine learning was just like predefined to say like, “Okay, it should perform very well on multidimensional data.” The biggest problem we faced in the beginning is just
that unfortunately, even if we learn
all these complex correlations, in the end, there
is only one auto on disk, which is
one-dimensional. Of course, you can see like
this become more complex, but in the end, you have
like one order for scanning. This is like big
fundamental limitations on how you deal with it
because at some point, it doesn’t matter how you do it, you need to transform
what you have in the multidimensional space
to one order on disk, and this is like in
the end what matters, or one order you
have in main memory. So how can you do that? So, let’s assume those are like the different data
points you have. Let’s assume that this is
maybe some order data. You have the order amount
and order Zip Code, and those are like two-dimensions
you make queries for. One approach you could say okay, if the the two attributes are
equally important to you, you project them just
through the middle, you can see is a form of PCA because you give each
the equal weight, and then you have one projection to a one-dimensional space, and this is like the sorting
order you use on disk. You can immediately
see in some cases, this is like fine, and others, this will never work very well. For example, if you know
that most of the queries you do are for the Order Amount and not for the Order Zip Code, the best thing you
can actually do is you order all your data regarding the Order Amount and disregard the
Zip Code for that. If you mostly lookup
for Zip Codes, you would do the opposite. You order everything for Zip Code but you ignore the Order Amount. So, there’s a trade-off
in how you actually serialize the data and
how you lay it out. So, the interesting thing is we found that you can
actually even mix them. For example, you can say okay, I project on to Zip Code, then I built larger blocks, and though within a block, maybe I want to use
a different projection, in this case Order Amount. So, I first split up into Zip Codes and then I project
down to Order Amount. So, I have like
every block is now ordered within regarding to
the Order Amount. And again, I can trade them
off on how they depend, and what’s best highly
depends on your workload. If you never touched
the Zip Code, you should probably not order for them because it doesn’t
provide you any benefit. This led to our final design. Right now, what we have for
multidimensional indexes, is we start out with
the data points we have. We have the first model which
is called the ‘Projector’, which based on the workload, projects it into an order
we have later on on disk. What we do there is exactly the trick I showed you before. We allow columns on how we split it down to this
in this like blocks. So we start with the root node
which figures out the general direction of the data depending on what cruise
they are most common, then it projects everything
to this dimension. Then we repeat
the process and make directions projections for all the different blocks
separately again. Often they’re orthogonal to each other just because on
how the data’s lay out. So we get like for example
these projections down here that you can see now
like this in every block, we have actually
a different order of them, and therefore, like models of models
again to define or find some point in that space
later on on the disk. Also interesting, this can be recursed so you can make
it as deep as you want. And this model itself here is the same data structure
that’s what I showed you before for RMI. This is like again,
this expert model on the top which picks another expert which organizes this differently. This, if you have
only linear models, can be executed again for two-stage models
and two multiplies, two additions and
one error lookup. So it’s extremely fast. However, if you
have the Projector, it still doesn’t give us
now like where to find something in a dense array
because still, there might be space in between. So what we do in order to get the dense array is nothing
else than what we had before. It’s just like build a Locator
again using an RMI model. So, this is like essentially, we have everything in
one-dimensional space, and now, we built like the same B-tree model what
we had before all that. So, this is pretty new, but some very rare initial
performance number shows it’s pretty promising. So, for one million
data points right now, you get roughly 200 nanosecond
lookups for point queries. This is like a speedup of
two to 10 X over the R-Trees, depending on how you
are configure it while you have like an order of
magnitude space savings. Of course, there’s more to come. We need more workloads, and this is essentially, this lookup here is
without considering the workload itself yet, because we haven’t
done that even though, it just assumes everything
is uniformly likely, which is actually
the worst case scenario for us. Also interesting to note is, this data structure
in contrast to an R-Tree doesn’t increase with the numbers of
dimensions you have. Instead, it should increase
with the rank of the data. So, if you have for example, at the same attribute again, just a full duplicate, it wouldn’t make any change to our layout or the
model complexity. So, for future work, there’s much more to be done, we are heavily looking
into sorting and joining. So essentially, what we want
to figure out, is just, can we not improve
every single component of a database system using models, and how would that
look like if you put in the end
everything together? At the core insight is, that if you have an efficient model about
your data distribution, how would you actually build your data structure or
algorithms with that? You have an Oracle,
which tells you, the probability mass given
for a key in your data-set. Probably this is Oracle, you would design many
of the algorithms and data structures you have in
a very, very different way. The other thing is, it’s all
about continuous functions. If you compare that
now, for example, to auto preserving
hashing, perfect hashing, local sensitive hashing, all the different variants
you have, normally, they don’t take
advantage of the fact that you can discontinuous
functions much, much better approximate
the rough position, so it’s like a form of interpolation search
but often done with regression or more and
more complex shapes. If you do that, you have
data structures which learn, automatically adapt
to the data you have and they’re potentially run on TPUs or GPUs or FPGAs, whatever is your preferred
machine learning platform. In some cases, can even lower the complexity class for
storage as well as lookup time. For example, if you saw that in the case of the data area
look up and I can also show you
similar things for the multidimensional
cases now as well. However, just a warning ahead, this is not an almighty solution. There of course datasets, where still a traditional B-Tree
might be better, is a lot to be done. But it’s, I think
a very exciting direction. I’m also very happy to
announce that we are starting DSAIL for data systems
for AI Lab at MIT, and that Microsoft is one of
our major sponsors for it. It involves half systems people, half ML people from both sides and the key questions we
are investigating is, how we can use
machine learning for systems or build systems
for machine learning? With that, I think I’am
more or less in time. I’d like to conclude. So, what I showed you is
a new approach for indexing. I think it’s
a framework to rethink many existing data structures
and algorithms and there’s certain conditions that might
change the complexity class for a particular instance of the data that’s always
important to say, and the idea might
have implications within an outside database
systems. Thank you.>>Any questions?
There is a question. Peter just, let me
get you a microphone.>>I can also repeat
it, it’s fine.>>Thank you. So, can you explain a little bit
about what it looks like when you’re
actually learning these models versus the process
of building a B-tree? So, if you’re doing dynamically maintaining
these things on the fly, it’s a little unclear from
your talk what it looks like? If it gives you a big blob
of data, what do you do?>>Right. To create an indexer, what I mainly talked to you about was only read-only data. So, the simplest thing is just, you sort your data and then
if you use for example, the two-stage linear model, the top model you can train
in one pass over the data. Actually you don’t even
need the full data, you can skip pieces of it because the sample is often
more than enough. So, it’s just one pass over it because
it’s a closed form, there’s no gradient
descent or anything. Then, what you use is,
you use the top model, go over all the data
points you have, put every data point
into the bucket of what expert it belongs
one level down, this is our greedy
training approach and then for every bucket
you repeat the process. So, you go over the data
as many stages you have. So, if you have
two stitch models, you have two passes
over the data. If you have three stitch models, three passes over
the data, and so on. It goes extremely fast. I mean, a B-tree you can build
in one pass over the data, if everything is sorted but the first one doesn’t
work on a sample. Inserts, there’s another paper
by Casten and some of my collaborators
at Brown where we look into on how we can do
inserts more efficiently. The easiest way what you do is, you have a delta index, instead of doing the inserts
every single time, retraining them all, you put
them in a delta and then at some point you figure out
if it’s lost by retraining, or merging, or whatever.>>There’s a question
back there at the end, and then Jen has a question.>>So, Tim you’re mentioning that you’d like to know the CDF’s of the single column
distributions in advance. Isn’t that already given to
you by the histograms that most database systems already maintained for these columns? So, I didn’t see
you talking about the interaction
between histograms and the machine learning process.>>Right. So histograms are in the end still
very coarse-grained. So, the question’s, if you
would create a histogram with the granularity we need for
actual looking the data up, then it’s a question, how do you browse the histogram itself? And if you solve that problem, actually what you
get is a B-tree. So, I can show that you
can transform one into the other because you need
the lookup structures over it. So, histograms themselves
don’t help you that much. There’s something
like look-up tables which was like
the thing in between, but I had numbers for them, they also don’t
help you that much. The interesting thing is now, let’s assume I built
my better model for it, let’s say that I model
two stage for the index, I can use the same
model as my histogram. So, instead of having histogram, which I cannot use as an index, I can use my indexes
as a histogram and the performance penalty
is almost not existing. Even better, I can use the same histograms also to plug them into
my query optimizer, I can desalting with them. Essentially, if I
have one CDF model, I can use the same model in many, many different places of
my database system from joining, sorting, indexing, query
optimization and other things. The key challenge
though there is, it’s very simple
for one dimension, it’s a little bit harder
for multiple dimensions. So then, there’s
just conditional probabilities and so on what you do there.>>Does it work? Back
to Peter’s question, how about other issues
such as locking, logging, compression
and serialization.>>Logging, locking,
compression and serialization.s Let me start with compression because
it’s the best story. So, let’s assume that your index, given the key, is gives you the position where
to find that key. Let’s assume my model is in a way built that I can
ask the reverse question. Given a position, tell me what
key is most likely there. If my model would be perfect, I don’t need my keys anymore. It compresses my data, essentially, because
given a position, I know which key I expect there. Why should I store
the keys anymore? If the model is not
perfect, what I can do is, a delta compress against the expected key
at that position. So again, the model becomes
my compression technique. So, that’s compression. Locking, there’s a big question
about on how you actually make B-trees just like serializable and everything
as everything else. Arios is just the paper everybody hates and still everybody uses. But, in this way because you deal with the inserts in a different vein or
approximate fashion anyway, my assumption is that many of this complexity of locking
is not actually needed. I think Microsoft is heavily
looking into that, I heard. So, they are very, very
excited about that. Logging is like
transaction processing and logging is still
something you will need, because you want still serializability guarantees
and be everything consistent, so I doubt that Machine Learning, at least I cannot
think of an easy way right now to improve that.>>Well, with that, Tim thanks very much
for an exciting talk.>>Thanks. Our last speaker for
this session is Matei Zaharia, and he’s a faculty member
at Stanford University, and previously, he
was at MIT faculty. During his PhD, as you know, he started the Apache Spark
computing engine as well as worked on a number of open source projects,
such as Mesos. Apache Spark turned
into a company that landed a founding
called Databricks, for which he is the Chief
Technologist as well. He has had a number of awards: ACM Doctoral
Dissertation Award, the VMware Systems
Research Award and the Daytona Graysort
world record. More recently, he
has been working on asking the question, to have machine learning
and the data scientists be well-supported on the data analytic platform,
what needs to happen? What are the gaps? And he’s going to talk about that today. So, after having
finished the two talks, looking at how ML or the analytic techniques could help the database internals, we’re going to shift gear and listen to him to see
what we need to do to make sure in relation
to the querying and other things that
analytic systems give, what you have to do
special to make it easy for the data
scientist. Matei.>>Yes.>>Take it away.>>All right. Thanks, Surajit. Let me know if you can’t hear me. Okay. So, yeah, as Surajit said, I’m going to talk
about infrastructure for usable machine learning, and I’ll actually start by talking about a bunch of research that we’re doing in a lab on this at
Stanford called DAWN. Microsoft is one of the generous sponsors of
DAWN, so we appreciate that. And then, I’ll also
briefly put on my industry hat and
talk about a type of system that basically
hundreds of organizations are building in industry and that there isn’t a lot
of research about. So, I think it’s something
that we should look at as researchers as well, although I am looking
at it in industry. So, it’s really the golden age
of machine learning. There are incredible advances
in image recognition, natural language,
planning, yada yada yada. It’s a great story and
they’re starting to have societies scale impact
every answered that. But if you look at it
a little bit more carefully, this statement comes
with a caveat. So, it’s the golden age of machine learning for
the best funded, best trained engineering teams. If you are a company with
tens of thousands of engineers, it’s awesome, you’re
happy, you can do it; but it’s still very
difficult for anyone else to actually use machine
learning and have any kind of real-world impact. So, building machine
learning products is still very difficult
and very expensive. All the major successes, the things that actually make it into products that
make a difference, things like Siri, or Alexa, or Autopilot require hundreds
to thousands of engineers just working on them continuously to build and
maintain those systems. And the interesting thing, if you look at it is, what are those engineers doing? Are they all sitting
around a whiteboard doing integrals and drawing
neural networks and stuff? They’re actually not doing that. They’re not really doing the stuff you learn in
a machine learning class. Most of the stuff they’re
doing that’s really expensive is data preparation, quality assurance, debugging
and productionisation. So, that’s what is really
needed to feed these systems, and it’s not modeling. And so, just a domain expert in a specific area can’t easily build machine
learning products. So, if you look at this from
a research perspective, it’s actually
a really interesting thing to look at because it means that the problems that
people have in production, any, say, enterprise user
you talk with you ask, “Hey, are you using
machine learning? What’s hard?” The problems
that they face are not the problems that machine
learning researchers primarily work on. Those researchers
work on settings where you already have a dataset, you already have
a target metric that everyone agrees is the
right one to optimize, and you don’t even care about
productionizing the model, you just care about coming
up with a training method; and that is not what
anyone actually using this stuff
for real has to do. So, is this really
a big mismatch where, as a systems or
data management researcher, you can go in and have an impact. We’re not the only
ones to say this. Another example of a paper that says this is actually
this one from Google, where they talk about what’s involved in building these
systems and they say, “Only a fraction of the real-world ML systems
are the ML code.” All this other stuff around that is what takes a lot of time, is expensive, and is hard to do. So, in the Stanford DAWN Project, it’s a project with
three other PIs: Peter Bailis, Chris Ré, Kunle and myself. We’re looking at
specifically this problem: all this stuff needed
around the machine learning algorithms to build
production applications to enable domain experts
to build them. And we want to make
it easy to build them without the PhD in
machine learning, but also importantly without
being an expert in systems, and data management and
hardware to get all that other infrastructure
around it to work. So, in this talk, I’m
going to talk about a few pieces of what
we’re building in DAWN, just to give you examples. I won’t have time to go
into a lot of detail, but just what are these problems outside of traditional
machine learning. And then, I’ll also
talk, as I said, about some stuff happening
in industry in this space. So, first thing, I’ll
talk about, so DAWN, we were building the
software Stack that spans all these phases from data acquisition
to production. And the first thing I’ll talk
about to give you a sense of what these problems
can look like, it’s actually one of Chris
Ré’s projects called Snorkel, which is at the data
acquisition phase. And this is about acquiring
or using training data. So, training data is obviously the key to
machine learning, and the places where
it’s been most successful are the places where it’s very easy and cheap to obtain high volume
training data. So, image search for images
that occur on the Internet, not for things, like
medical images. So much data, you can look at it, you can build really good models. Speech: people are
talking all the time. I’m talking right now,
you can record me, you could probably build a model of what I’m going to say. Games: you just play some some video games
against yourself, that’s good, you’ll
get lots of data. On the other hand, a lot of the business applications
that actually matter, ones were getting labeled data especially, is
quite expensive. So, medical images,
you can’t just google them and see lots of x-rays and start
learning from those, because there might
not even be that many patients that have
a certain disease. Document understanding,
you need, say a lawyer, to sit down and read
a piece of text and tell you what the law
would interpret it like, and that person is going to cost you thousands
of dollars per hour. You can’t just get
people to click on things for one cent
an hour and so on. So, the question in the
Snorkel project is, how can we leverage data that’s expensive for humans
to label at scale? And this labeling can
easily be 90 percent of the cost in many business applications of machine learning. So, Chris’ project, Snorkel basically, looks at
a different interface for labeling which is called
Labeling functions. So, instead of asking humans
to just give you Labels, look at all these examples
and give me a zero or one for each one based on
your expertise as a human. Despite the Snorkel asked them to provide Labeling functions, which are short programs from a computer
science standpoint that can give you a guess at a Label but they
may not always be accurate. So, for example, one thing you could do is
you could sit down with say medical doctors and show them some notes about patients and instead of
asking for each one, is this disease or not? You can ask them, hey what do you look for it that lets you think that
this is heart disease? Then you can turn that into a little Python function like searching for X or
something like that. Then Snorkel is a training system that can take a bunch of these functions and
it simultaneously learns when each one
is right and wrong, how noisy it is and the target
model for the data. So, it can incorporate this uncertainty from
these functions. But the benefit of
this interfaces and just by sitting down and coding a bunch
of these functions after you interview
a domain expert, you can then apply
them to millions of unlabeled examples and documents. So, for example,
Chris’s group sat down with with researchers in
the medical school at Stanford who are
doing these projects to automatically
understand document or case notes or things
like that and in just four hours of
writing stuff with them, they would match
basically months or years of hand labeling
these poor grad students in these fields like part of their PhD and sitting down for two years to label documents. To build an automatic system
for understanding them. So, that’s kind of the impact
obviously saves a lot of time and money and across
a variety of datasets, Snorkel can match
basically systems that have been trained
on labeled data just by using a bunch of these
functions and millions of examples that were not labeled
by humans at all before. So, this is
a very brief overview. Chris actually has
a whole research program around this called Data
programming which is about automatically prepaying, training data as a kind of a first-class concept and
it’s also all open source. So, you can find out about it online but this is an example of a type of problem that real users of machine
learning will face. As another example
of a point problem, I’ll talk about production and in particular
serving and I’ll talk about this project
NoScope which is a joint project between
myself and Peter Bailis. So, machine-learning models
are very accurate. Especially the new deep
learning one in many cases, the larger the models
are the more accurate they are and so people
want to deploy them. But so many times
the actual inference parses a is the bottleneck
and it’s going to be the most expensive part. This is why you see
projects like Brainwave and so on that try to
accelerate that influence. So, in this project
we looked at using CNNs to do queries and video, something a lot of
people want to do. CNNs are really good at recognizing and labeling
objects now but the problem is these really good CNNs
they’re also expensive. So, they can process
one video stream in real time on a large server class GPU. That’s what’s
considered real time inference in the computer
vision community and of course if you
have many say millions of hours of video are
lots of video streams, you don’t want to stand
up a huge GPU cluster to monitor and even for
a modest-sized building this would cost
millions of dollars. So, in this project, NoScope
will look at how can you get orders of magnitude faster, video equates with
a minimal loss in accuracy given these black box CNN models that you know are really
accurate about expensive. So, we use a bunch of
techniques to do this, but one of the key techniques
is model specialization. So, basically, we’re given this model that’s good
at recognizing things. You don’t really
necessarily need to know what it does internally and
we’re also given a query. For example, I just want to count cars that are driving by my building and we’re also given a bunch of data
to run this query on, which as you know
this specific video frames on the camera
outside my building. So, what we do with this, is we use this big model. We label a few samples from the data with it
and we use it to train a much smaller specialized model and this model is
specialized to my query. It doesn’t need to
recognize a pan dies or motorcycles
or things like that, it just needs to recognize cars. It’s also specialized to
my data distribution. For example, maybe my camera is looking down at the street, so I don’t need to worry about cars that are flipped upside
down and things like that. So, these are reasons why from a machine learning
standpoint you can have a smaller capacity
model that is pretty accurate for this
particular query and the other cool thing
about this model is we can train it to also
output a confidence score. So, when it’s not sure
about the label of a frame, we can always call
the original model. It turns out that by tuning
basically the size of this model and
the confidence thresholds, you can actually capture most of the accuracy of the big model and still save 99% of
the invocations to it. So, we also design of
a cost-based optimizer that can solve this problem
of how exactly which specialized model to use
and which thresholds. Basically, the results
from this are that, depending on
your accuracy target, you can get large speedups
often an order of magnitude or more over these original models and still capture
a lot of the accuracy. So, we evaluated seven streams and this first paper
we had on it and basically decides the best one where you could go
thousands of times faster and keep 95 percent
of the accuracy. Even in the worst sides
you can go about an order of magnitude faster. So, it’s a pretty powerful
technique to use for inference and it’s easy to apply
probably beyond video as well. Just as an example of extending this further and ongoing work
which is unarchive, we’re also extending this to more complex SQL-like queries where we see a query
and we use a mix of techniques including
model specialization and appoximate
query processing to design a very fast inference
pipeline for that query that meets
a certain target accuracy. So, you can read
about that as well to see how you can combine
these different things. It does a little bit more than just model
specialization and that. So, that’s an example on
the inference side of the problems that are needed to actually use these
models in production. The next step I’ll talk about
and the final Dawn example, I’ll talk about is for
software development and the approach we have
several projects looking at here is designing
end-to-end compiler that let productivity programmers get
excellent performance and have basically built
a production grade application. So, I’ll talk about, specifically one of my projects here which is called Weld. So, if you look at the data science machine
learning space, there are a ton of different processing steps you want to do with data and there’s
a huge ecosystem of libraries, and also you want to
experiment really quickly. So, programmer productivity
is super important and the main technique we
have for a program or productivity in
general is composition. It’s the only way that we’ve made it more efficient to
develop software. So, ML app developers will compose functions from
all these high-level libraries, thousands of Python packages, R packages, Spark
packages, et cetera. So, for example, you
might use something like Pandas in Python to
clean up some data, you might use NumPy to do some operations on
it, to normalize it. Then, you might use
a machine learning library like Scikit-learn and so you’ll be
composing these functions. Now, the interesting thing
about this is that it creates an important problem for optimization
which is even though each individual function
you call might be highly optimized
and they often are, these numerical routines
are very carefully tuned, your end-to-end pipeline
could be extremely inefficient, and that’s bad. It means that you’re
training will be slow and then your production
serving will be slow. In many companies actually
the data scientists have to throw their
code over a fence to a production software
engineering team that rewrites it to make it
efficient and that’s obviously not going to be cheap or and it’s not
something we want to do. So, the one reason this
happens is just because the traditional interface
for composing these libraries
doesn’t give you a lot of room to optimize. In particular,
most libraries have these interfaces written in
terms of function calls, that’s the abstraction in
every programming language and to write a library
that operates on big data, you give your function pointer to data that’s sitting
somewhere in memory. That’s usually how you
pass things still. So, this is an example of a bunch of functions
I’m calling in Python. I’m going to parse and
then filter some data and then compute an average and then this is what
actually happened. So, these functions in Pandas and NumPy are actually all hated and see they’re well optimized. But, what’s really
going to happen is I’m going to call
each one separately and scan through the
data to read it out of memory and then write it back each time I go through it. So, for example, parse CSV, the interface is I
give you a pointer to the input string and I give you a pointer to an output buffer
and you parse it. So, has to do basically two scans through
the data, then dropna. A again, the interfaces
I gave you an array, you gave me a new array, it’s doing two to the data
again and finally in the mean, I’m going to do a single array. Of course, memory access time dominates especially
for big data. So, in a real application, this can lead to a significant
slowdown and we actually measure it and just sort
of example workloads and popular frameworks, even though each
operator is optimize, they use dozens of
operators and you get these very large overheads from just data movement across them. So, in the Weld
project at Stanford, what we’re doing is we’re
basically designing a common intermediate
representation and a common run-time for data analytics libraries similar to like a language
virtual machine, but designed for
data parallel code, and the idea is these libraries submits the computation they want to do to design time engine, it leisurely collects
the computation from different libraries
and then it can optimize it for
different backends, and this can lead to
significant speed-ups as well. So, this is an example of
a Pandas and NumPy pipeline, it’s combining operator
is that all written in C, but if you turn on Weld, you got around authentic
speedup on one-third. If you turn on class library
optimization and Weld, you get another factor of C, and then Weld’s intermediate
language is actually, it’s a functional language
that’s data-parallel. So, it’s trivial to
safely parallelize it. So, you can also turn on multi-threading underneath
these existing libraries and get a large speedup. We have a few papers on the
system and also on how we optimize underneath it
for real workloads. Then again, so just to give you a sense
of some ongoing work, this shows that okay optimizing across libraries is
useful and it gives, we tried to make
this as lightweight as possible to give users a really clean and simple intermediate language
they can write stuff in, but it’s a bunch of
work for them to instrument and modify
all these libraries to use it. So, actually in this ongoing work where we’re doing
Weld without world, we have abstraction called
split-ability annotations, and this is a way to enable
most of the optimizations that Weld gives you an
unmodified black-box functions. We don’t even need the source
code of your function. So, for example in
MKL, the Intel mass, coronal library, all these hand optimized routines
for linear algebra. One of them is actually adding a bunch of
vectors and they’re really fast and if
my program just needed to add two vectors,
this would be awesome. But my program
actually needs to do hundreds of linear
algebra operations and these functions
are designed to take the whole thing at once.
So, it’s not that good. So, what we can do and
this might be my program, it’s adding lots of
vectors together. What we can do with
split-ability annotations as we add an annotation
to each function that says how we can split up the data to call the function
on little chunks of it. This is very similar to vectorized execution and in databases with little batches. Is just adding that annotation to say how we can split this up. In this case, it means these are arrays that we’re
adding element-wise, so you can split them
anyway you want into little chunks as long as the chunks have
the same shape across them. So, we have this type system here that can express more
complicated ways of splitting, and once you do that, you can actually get the
significant speed-ups basically from pipelining underneath
these functions. So, in this case, even though MKL itself
is multi-threaded, it’s well optimized and so on, we got in this case a factor
of eight speedup in various, this is actually in
the simple workload but also in a bunch
of real workloads by pipelining the execution
of them essentially and it’s pretty cool
because we never had to change the code of MKL. So, this is still
work in progress but the interesting result
from this is it can often get competitive performance with systems like Weld or XLA, or other compiler based systems without rewriting these library. So, it’s a super
lightweight way to add stuff to an existing library. So, that’s an example of making programmers
more productive. Okay, and then the final
bit I’ll talk about now putting on my industry
hat for a bit is this new type of system for machine learning guide
industrial scale which is a machine-learning
platforms, and what’s really interesting
about this is basically, I think hundreds of companies are building these platforms. It’s a real need that actually engineering teams
building these things, but there’s not a lot of research about it and I think it’s an important system
to know about if you want to look at
the research in this field, there’s also a lot of unknowns
and ways to innovate here. So, basically the question
here is if you believe in machine learning will be
a key part of future products, what should be the software
development process for it. So, we see this in a lot
of the companies that I talk with for example at Databricks as well
as at Stanford. They said okay, we had one or two projects that
were successful with machine learning but
they took a lot of time and they were
expensive to do. How can I have hundreds of teams in my company using
machine learning? How can I make it not
just a one-off like sort of experiments but something that every team
can reliably use? And today unfortunately,
just machine learning out of the box is the development
for it is very ad-hoc. So, there are a few problems
that make this difficult. So, one problem is just
tracking experiments. So, when you do machine learning, you’re going to be running
many variations of your code. I’m sure in the
previous two talks, there were many versions
of the models that didn’t work and then finally use are the ones that did work, and today, there’s
no first class support for this in the systems you’re doing software development in general. You have a linear set of gametes. You never have like
hundreds of branches that you need to track and
compare with each other. So, every data scientist has their own way of
managing experiments. It’s really important to reproduce the results
and it’s also very difficult because you have
dependencies on datasets, on code, on parameters, all over the place and again, there is no automatic way to
capture them and it’s also difficult to share and manage models actually for
some companies, just tracking
which model is deployed, where did it come from is a non-trivial problem because there could be
thousands of models. So, we need sort
of the equivalent of software development
platforms if you think of something like building a web application or
a mobile application, there’s so much infrastructure
around testing, IDEs, load balancers, quality assurance that it’s
just there out of the box. You don’t need to worry
about it, but again, machine learning is
different because of the wide range of
parameters and experiments. So, many companies are actually building a new class
of system to do it that I’m just going to name ML platforms because that’s
why they named them. Some of the more prominent
examples that people wrote blog posts about are
the large web companies, so Facebook, FBlearner, Uber
Michelangelo Google TFX, but there are also many of
these at smaller companies and also even when I talk to individual machine learning
grad students or researchers, they say, “Oh, yeah, I have a Python library that tries to record
experiments for me.” That’s what they’re doing.
So, these platforms, the way these usually
work is there’s an engineering team
that says, okay, here are the 10
algorithms that you can use for machine learning and
if you use one of these 10, we will handle deployment, we’ll handle learning experiments tuning parameters and so on, so we’ll handle
the whole life cycle. So, it’s good because
they standardized the life cycle you know that if you work within this platform, it will remember
what you did and let you deploy it and let
your whole backend stuff. But just having
everyone build their own is also disadvantageous. First of all, they’re
each limited to whatever frameworks or algorithms that company decided to support, and engineers have
to add a new one, and second, they’re also tied to this internal infrastructure. So, you can’t do really and share these things across companies. So, one of the projects that I’m working
on at Databricks, is actually an Open Source, Open-Interface machine learning
platform that we launched back in June this year. It’s called MLflow. And I’ll just give you
a sense of what it can do, but you can read about this
and the other ones online, just to give you a sense
of the problems. So, one of the things that it
does is, by Open-Interface, I mean that it’s designed to work with whatever infrastructure you have and whatever algorithm, as opposed to like, “here are ten algorithms that we support.” So, one of the things it does is, it lets you package
reproducible projects. So basically, you can
declare what dependencies, what environment
your machine learning code needs in order to run, and you can also
declare parameters that other people from outside
can use to call it, without having to
know the details. And then in your code, you use these parameters
and you do your work. So, that means you can
package your project and run it again and get
the same results later. Second component it has,
is experiment tracking. In your code, you basically use this REST API to log information, like here are my parameters, here are my metrics, here is the model I built. There’s also a way of
representing a model. And then, MLflow basically
builds a giant database of all the experiments
that have been on for a specific goal, and you can go back and
re-run a specific one, or compare them, or pull out the models and do
something with them. And then, the third part
in this handle is deployment, which is, once I’ve got a model, it actually represents models
as basically a function, very similar to
a serverless function. And we can then deploy the same
model to many different ‘Inference Tools’ such as “Batch” and “Streaming
Inference.” Basically, it’s a pretty simple
kind of workflow system, but it’s also something
that has a bunch of interesting data management
and systems challenges. How exactly should I represent
projects and dependencies? How should I represent the experiments? How
should I equate? I mean, I can imagine
queries like, “I had a thousand different
versions of a model, and I have a new data point, and I want to know which one
does best on them.” There are a lot of
interesting problems to look at within this space. And if you want to read about
this or the other ones, I think it’s an interesting
thing to look at, because again, people are
actually building these, like hundreds of companies have independently decided that
they need to build a platform, but I don’t really see, when I look at a database class
or something like that, I don’t actually see anyone talking about
this class of system. And there’s probably
a lot that can be done to make these things better. Many open questions left in
designing these platforms. Okay, so that was kind of my brief foray into what’s
happening in industry. To conclude, the limiting factors of machine learning adoption, are the development
and productionization tools for many users,
many real users, especially the ones with really high impact
business-use cases, and they’re not
the training algorithms. I think this is a really great opportunity
for researchers, including data
management researchers, because a lot of it
is a data problem, and they’re very unexplored. You can follow the research we’re doing on DAWN in our blog, and I’m also happy to
chat with anyone about problems in this space. Thanks.>>Of course, we
are running late, but let’s ask if you have
any questions for Matei, let’s take a few.>>Hey. Thanks for the talk.
I love everything.>>Yes.>>I have lots of
questions, but in the ML platform side
of things, right, I think especially as
we start applying this to optimize systems becomes
particularly problematic, because all of a
sudden, SQL Server stopped running correctly.>>All right, yes.>>Because something
somewhere happened. And now, I think that beyond
what you’re describing, there seems to be missing sort of the equivalent of
software engineering practice.>>Right, yes.>>In software engineering,
we have like “Find Bugs” and you know
“Code Reviews” and all sorts of practices that
kind of protect us from the worst mistake and socialize our understanding of what the
system is supposed to do. Do you have any ideas, beside
tracking the experiments, like how we get into.>>That’s a great question. Yes, that’s an awesome question. And yes, I think it
is missing and people are trying to come up
with best practices, so just as an example. So some of the things we want to enable with MLflow
is to tell people, “Hey, if you follow this process, write your code in a project, submit it this way, pull out the model this way, you will get the
following benefits.” Here are some examples
I think I’ve seen. One thing I’ve seen
at one large company, is they actually have
leader-boards for common tasks. Let’s say you’re building a better language model or
a better image classifier, there’s a single leader-board
in the organization, where you can see how
others have done over time, you can submit yours
and you can compare it. And that’s a way that many
people can work on a problem, and you can see how it’s doing. There’s a lot of
interesting stuff with testing. Is my model doing worse than, say, a naive linear model? Is my data drifting in some way? Are my predictions drifting? Is the distribution
of the features I see in production different? For example, a common thing
is actually you have a bug, where you didn’t load one of
your features into training, or it didn’t get
transformed the same way. That’s another practice. The other one that is
actually hard to talk about, but is actually even
defining the metric you want to change and making
it easy to swap metrics. One of the things that we
want to do in MLflow is, if you trained a bunch of models already and now you want to
change the evaluation metric, how can we easily let you
switch that and basically add a column to that table
and compare them on the new metric. Yes.>>Other questions? I think we don’t have the time that we
have set for the panel, but this is related
to Carlos’ question, and I would ask both you, Andy and Tim to comment on it if you have
any thoughts on it. This is really the software
engineering question, but sometimes for example, the kind of work Andy
and Tim are pursuing. There’s always this old-fashioned database or systems people, the fear that something will go wrong in
a robustness sense. That yes, it’s an average case, it’s great, but certainly, we are going to be
exposed to a variance. And this is possible
for any system, but typically in a distributed system or a database system, that are guard rails. That, you know that
you’ve tripped that fuse, and we basically say, “Okay, things are not going well.” Of course, you don’t want to
trip the fuse all the time and have some
discipline around it, but is this manageable in the
world that we are entering? And this is also true when we’re building such MLflow and so on, is there any support we
get from the system, for understanding this or experimenting with it
at the last scale? This is the only question that
I’d like all three of you to comment on it briefly if
you have any thoughts, right. Matei, you can go first, you’re just right
here at the podium.>>Yes. The question is how to have guardrails for
machine learning, or getting [inaudible] behavior. I think it’s a good question,
and I think so far, I’ve just seen people do
application-specific stuff. But basically, I think the most important
things I’ve seen are having a fallback, that is, maybe it gives
you lower quality results, but it’s robust and you can
predict stuff about it. And then also having some kind of monitoring that tells you if
things are going haywire. But the way you do that can
be different in each system. Yes, I don’t think
there’s a guarantee.>>I think I understand
the fact that we can go to a safer alternative.>>Yes.>>But to detect when
we should do that,.>>Yes.>>Is sometimes the most
challenging part,right.>>Yes, it can be. I think again partly, just data drift is the easiest of
these things to capture. Even that is not super easy. But like saying, “Hey, my predictions look
super different, or my input data look
super different.” Yes.>>And I think it’s even harder. It’s not just falling back, it’s falling back gracefully, because if Machine
learning give us 20x better than where
we were before, falling back to before
is unacceptable.>>Yes, definitely.>>We should smoothly fall back.>>Yes, just in case you didn’t hear Carlo,
Carlos’ point was, just having a very step function like alternative to fall back on, is probably not desirable
for the system to have some ways to do it gracefully
to the extent possible. Tim and Andy, if you have
thoughts, come up here. We’ll enable the microphones, unless you have given it up. And that will allow
you to speak here. Are you guys still tied up?>>I don’t know. Yes,
I’m still hooked up.>>Can you turn
these folks all both on.>>Sure. They’re on.>>Okay, so
the only thing I would add from Matei, I
think he’s right, is you have to have
a clear objective function to determine when some things
are going bad. But the only thing I
would add is that, exploitation versus exploration, I think having
the algorithms be mindful of what is expected of you at a given time of
day or week or month, that way, if it’s at nighttime, when your demand is
low, you can be more aggressive about trying
different things. But during the day, when you have enough SLAs you have to meet for transactions or your queries, then you want to be a bit more conservative and not
try something crazy. Maybe just not do
any tuning at all, right.>>Tim?>>My perspective is
a little bit different, because we mainly
want to use machine learning for improving
the systems, which is a different domain
than if I want to build a model for
detecting images. And I think, for example, for the indexing case,
we particular, are now looking into
techniques that we can find bounds for the arrows, we can get them the best case. There’s this whole field of
blog about smooth analysis, which allows you to get, for certain types of
models, guarantees. And then, you can say, “Okay, in the worst case, I get the
following behavior,” right? And like, “My model also
grows like a B-tree.” Like the B-tree gives
you a strong guarantee, because you know in the moment it goes
like out of balance, you re-balance
the whole thing to get the same lookup guarantee again. There are some machine learning
models where you can do something similar. There’s hope. For the new net stuff, it’s just like
old nets right now, and there’s not much you can do. You need to restrict
the tuples modes. In some domains, you can do that, and in some others, you can’t.>>Any thoughts, questions
from the audience? Well, thank you all. Thank you very much for
a very entertaining session.

Danny Hutson

1 thought on “Database and Data Analytic Systems

Leave a Reply

Your email address will not be published. Required fields are marked *