Differential Privacy for Growing Databases

Differential Privacy for Growing Databases


>>So, hello and
welcome everyone. It’s my great pleasure
to host Rachel Cummings. Rachel is a Professor at the Industrial Engineering Department and Georgia Tech. She works on
differential privacy, game theory and connections
between the two. She’s a recipient of multiple awards assignments
[inaudible] graduate, is it fellowship award, a moody Doctoral Prize
award, [inaudible]. So, she’s going tell us about differential privacy
for growing databases.>>Thank you. Thanks guys, I’m very excited to be here. I was told this was
a very casual environment, and so feel free to jump in
and ask questions it’s much more important that
you understand that I go through all the slides, and I also have a
mild like I’m speech impairment and so if
I pause briefly I promise I’m not just
daydreaming during my talk. So, I’ll talk about
joint work with some colleagues
and collaborators. Sara Kreeble is a former Georgia Tech
grad student now faculty at University of Richmond, and cabinet tower current grad
students at Georgia Tech. I’ll be talking today about how we can extend
privacy guarantees to the case where our database
is going to be growing and changing over time.
What do I mean by that? So the olden days, data science meant that
you collect data once and you analyze it and then you
go home and the world ends. So this look like things
like the set senses, I get new data every 10 years. Maybe like a survey that
was mailed out and you had to first build a survey, mail it out and then wait three months to get
the responses back, or if you’re a very fancy, maybe there was a call center
and you had to like train these people to call and ask very specific questions
and they collected data, and you analyze it at the end. Modern data science
looks very different. It happens online and it’s
changing all the time. We’re now doing things like collecting everyone’s health
information and their tweets and
their social behavior online, and maps, and so we now have to do like a continuous analysis
of a growing database, because we’re acquiring
more data over time. We also can’T wait until we accumulate all the tweets
in the world and then decide what’s trending, but in fact I have to tell you all the time on
the current database the current tweets and
the current trends. So, we need new analysis tools for the case of
growing databases. But in these examples, it’s also like
highly sensitive data, and so this is
your health information, and this is your social data, and your GPS data, and so we also need some kind
of privacy guarantees, because we’re now collecting
much more sensitive, much more personal
information about people. So, some background,
differential privacy was defined in 2006 byte work makes sharing nice Siemens Smith, and it informally bounds the maximum amount the
one person’s data can change the output of some computation performed
on a very large database. Formally it says this, “So, an algorithm M, the maps from an N tuple
of types,” and so think of that as a data collected
from N different people, just an arbitrary range is parametrized Epsilon
differential private. If it’s the case that for
all neighboring databases, they’re the same except for
one person’s data and then an output approximately
the same thing on these two
neighboring databases. By approximately, I mean
we’re going to compare the probability that
the output anything S, and the probability
that I produce S should be multiplicatively close up to this e to the Epsilon factor. So, in pictures it
looks like this, I have some blue database and they produce
some blue curve over there, and I’m then going
to change person i’s data from t_i to being t_i prime, and then going to get
the red curve instead. So, these things are close. So, I can then bound
the impact of person i’s data. So really I’m saying, if you provide your data
or if you don’T, if you’re truthful or if you lie, I’m going to learn
about the same thing anyways until your portion of my dataset has
a very small effect on the thing that I learn
at the end of the day. So, think of this set S as the collection of
all bad outcomes, maybe this is like I’m going to analyze our health data to decide insurance rates
and decide if like smoking causes lung cancer. Like saying like
your insurance price is going to rise to
the same probability if you share your data
versus if you don’t T. So think of this as
like the collection of all like bad outcomes that may happen as a result of
the analysis of this data, and then saying, those things
are going to happen with about the same
probability anyways, and so I’m really not
relying on your data. It’s also a very strong like
worst-case guarantee. So it’s worst-case over all pairs of
neighboring databases, and over all collections of
all possible bad outcomes, and so this is very very strong. Any questions so far?
Okay. There’s a couple of nice properties that make differential privacy desirable as a tool for designing algorithms. So first, it’s robust
to post-processing. So if I have some differential
private algorithm M, and I then apply
any other function to it, the post process thing
is still going to have the same differential
privacy guarantee, and so really this
says no adversary can take the output of my
differential private algorithm, going to corner,
think really hard, run some additional
computations on it and then like reverse
engineer my data. So, I’m promising no adversary can break this privacy guarantee. So, note also, there’s
no requirements on the computational
power of the adversary, the rather this is a like information
theoretic guarantee. It also well composes adaptively. So, this says if I have K different private mechanisms and they’re each Epsilon i
differentially private, then if I were to
compose all K of these, then I can just add up
my Epsilons and that gives me my final privacy guarantee. So, really if I want to run a whole bunch of
different Epsilon differentially private
computations on the same data, the Epsilons are just
going to add up. There’s more advanced
versions of this that I won’T include
today, for example, this still holds if you choose
these M_i’s adaptively, how you first run M_1
and see the output of that and then decide on which M_2 to run next, and
this still holds. Here, there’s also like I’m, relaxation of this where you have an additional delta
term over here, as like an additive slack, and so I then have this
multiplicative parameter Epsilon, and that I have as
an additive slack delta. I won’T talk about it
today but there’ll be some like brief mentions in passing about this like
Epsilon delta version, and so just keep in mind
that’s what the Epsilon Delta is there’s this like
additive delta term here. So, there is a more advanced
composition version where the Epsilons don’T
just add up but in fact it looks like
the L_2 norm of the Epsilons, and so you get
much better composition in the Epsilons where you have to have is like additive loss
in the delta. So, you can do better than this, but this is sort of like
the most simple version. Okay. So, back to why
differential privacy is great. So, because we have this
parameterized privacy notion I can now move smoothly between
informational extremes. I no longer have to say that it’s either
going to give you data or not but I can actually like quantify how much
information am I leaking to my epsilon by performing this particular
computation on my dataset. We saw that it has these like a post-processing in these
composition guarantees. Which mean if you want to design differential private
algorithms then you only have to reason about these like very simple
subroutines that are private and then you can
just add up the epsilons in your entire larger more
complicated algorithm. It’s being used in
practice and in theory, it’s being used here. Thanks guys for inviting me. So it’s being used in practice by Apple, Google, Microsoft, Uber and these census bureau. We now have a rich background of a differentiate private
algorithms that are both private and have some usefulness
guaranteed. So it seems great. The problem is nearly everything that’s been done operates
on a static database. So we have great tools
for static database, we have virtually nothing
for growing database. Previous work only
allowed us to count bits in a stream and that’s
all we could do privately. So today we’re going to do a lot more stuff privately
on a growing database. First we’re going to see a very specific tool
for answering adaptive linear queries
and that’s going to be based on the algorithm
private duplicative weights. And we’ll see why it can’T just extend naturally to
the growing database case. We’ll show how to improve it to make it work and
then we’ll talk about much more a general purpose
results and how we can extend existing
accuracy and privacy guarantees in the static case into the growing environment. Finally, this is a first paper on this topic and I think
that there’s lots of other types of growth to be analyzed and
so I’ll talk more at the end about questions
raised by this work. Any more question
where I jump in? Cool. Okay so the first part is on private multiplicative
weights for growing databases. This is a great algorithm
to answer exponentially many adaptively chosen
queries and so that’s a super powerful tool but it doesn’T naturally
extend to growing databases. Just one slide of that
preliminaries first. So, for this part we’re going
to store our database as a histogram and so think
of a database as a vector, where the Ith entry
is the fraction of the entries in the database
that are of data type I. So there’s N total types of
data and I’m just going to count how many of each
type have I seen. So for example if this
is a binary data, I want to count like do
you have a disease or not. There’s two types and you
can either have a plus or minus and that’s it and so then this is going to be a tuple. Maybe I want to count
how many people of which each age they are
and so then this is maybe 120 if I’m being
generous. Things like that. So, we’re going to be
thinking in this case about just answering linear queries where each query it’s going to also be described as
a vector and it will assign a weight to
each type of data X_I. We’re going to assume that
these weights are bounded between zero and one. An important thing that’s
going to follow us through the talk is the sensitivity
of a real valued query. Which is just going to be the maximum change
in that query f, if I were to change
one person’s data. So I’m going to say what’s the worst case over all
neighboring databases, how much can I change the
value of this function f. For these query classes
we’re going to have Delta f be of size one over n for a database of size n. Finally we want algorithms
that are going to be accurate and our accuracy notion is of this Alpha-Beta accuracy. So let’s say that an algorithm M, is going to be
Alpha-Beta accurate with respect to a query
class F. So here think of the class of
all them linear queries. If it’s the case, then
for any database f, I’m going to have a highly
accurate answer is produced by my algorithm for all queries in that class with
high probability.>>Your definition
of linear queries looks like you’re
not normalizing but in sensitivity looks like
you are. So you are?>>Yeah she’s normalizing.>>Yes. I am normalizing.>>Yeah a little where it’s one.>>Yeah and in fact is important here
that it is normalized. So, yes thank you. Yeah.>>Okay.>>This is a slight
abuse of notation here this is imagining that I can apply this query into the output produced by
my mechanism and so just like a grant me that for
the ease of notation here. Okay. So, here’s a PMW algorithm. It’s by Hard Roth
bloom in 2010 and answers exponentially many adaptively chosen linear queries. It works by maintaining
a public histogram Y. So, it contains
all the information that we know so far at
each point in the algorithm. So, these new queries
are going to arrive online and the algorithm
is first going to decide, is my Y so far pretty good at providing an answer to this query and if
the answer is yes, then it will say that
it’s easy and I can just output f of y
and I don’T have to touch my true database X. If not then it classifies
f as “Hard” and it will output a what was the answer to f and will
then update Y accordingly. The key idea here is
that you only get significant privacy loss in the hard queries
because for these ones, I really just have to
check is my answer on Y close to what my answer
is on the true database. So I’m not really
using much information about my database and this
work space showing that well, as time goes on, this Y is going to converge
to X and so I’m going to see fewer and fewer hard
queries over time and so therefore I’m not getting
very much privacy loss overall. In pictures, here’s
how this works. I have a a public histogram y, say that it starts being uniform and I’m then
going to get a query and say for example
this is my query f_1 and here is the vector
corresponding to this f_1. And imagine that
this query over predicts. So, the value is larger
on Y than it is on X. So this is going to
be classified as hard query and then just going to like down-weight y prime
according to this vector. So for example, because my loss on one
and two are too large, I’m going to decrease
the weight placed on the types of data, one and two here hope
they’re lower and then going to normalize and that pushes
more weight onto four. The opposite could
happen for a hard query. Maybe my f_1 look like this and I then under
predicted we’ll then going to do the same
multiplicative weights update to adjust my public histogram Y. The algorithm works by just
taking in queries over time and then if it ever see the hard query then it
performs this update. The guarantee have the static
algorithm is as follows. It takes in a database,
a query class, privacy parameter,
accuracy parameters, and a size of a database and it is on
epsilon differential private, and it’s also accurate for K
queries and for this Alpha. The important thing to
note is that there’s this one-third power
because that’s what’s going to change in
our extension of- oh, actually now I think
about it. Yeah?>>What Alpha
notation is accurate.>>Alpha is the additive.>>Yeah. I already
did that one. Okay.>>Yeah. It relies on this potential argument
that says well eventually my Y is going to converge to my X and so these things are
going to be very close. So I have to show that this
entropy start slow and it goes down fast as it answer more hard queries and
it’s also bounded below. So it’s a standard
potential argument. So, for the fixed database case
it looks like this. Starts at most log
n. By the way that the algorithm initializes
things like initializes Y through the army uniform
histogram and so it starts at log n. Here we can also quantify as a function of the learning rate how quickly to this potential decrease
and it turns out that for every HARD query it goes down by this term Alpha
squared over four. If you tune the optimal
learning rate and is bounded below
by zero. So great. What happens if
the database is growing? Do these things still hold? Oh, horrific. Okay good. So this case, we have at most four log N
over alpha squared HARD queries and then
just by like composition, I can then guarantee that my epsilon loss is
going to be small.>>In terms of privacy loss, this theorem is
stating that instead of N epsilon privacy loss, you’re only getting
log N epsilon privacy loss?>>Yes.>>Yeah.>>This N is the size
of the data universe. But if you look at this like, it is like alpha
squared term here.>>There is log K there.>>Oh yes, yeah. Yeah good, yeah.>>So you can actually, okay.>>Yeah.>>Competition
theorem will give you K epsilon but now you
are saying log K.>>Yes, exactly. So this is why it allows
you to answer exponentially many queries because
once alpha hits one, then you’re not really
learning things anymore. So I can now answer- and
so I can now have a K that is like an exponential in my N before my accuracy loss
gets too large, yeah. The real crux of this is because you don’T actually answer everyone is just because you like only answer
a small number of them.>>In fact, it’s the property of multiplicative rate
of data algorithm.>>Yeah. Okay, so this
all works for a static. Database is very powerful. This is an amazing algorithm
but it doesn’T work if you allow more data to arrive. So we have this part of our database of size N,
for this picture. Then let’s say new data arrive. Well, my entropy can
go back up as they add more entries and in fact that’s going to
be the major problem. We can still show that
it’s going to go down a significant amount
for these hard query and it still bounded below, but the size of this change
can be a really big deal, and in fact it can
look like this. So as your database
doubles in size, you can have this log
N over alpha growth, which is going to be fair, yeah?>>Just we’re talking about the previous theorem
just to check. So at the end, there’s almost no
HARD query because the public databases has
revealed the private database. So all the secrets are out, there’s no more secrets
still, results.>>That’s one way
of looking at it. The other way of
looking at it is that I have found a database, this shares the same
statistical properties as my original database, but may or may not
be exactly the same. So it enables you to answer all the queries in
your query class that you care about but I’m also not guaranteeing that
it’s the same database.>>You said the query,
the static query classes is going to smooth and maybe different from
the competitor.>>Yeah.>>The entropy looks like the x. Entropy is smaller than the databases
are really very high. The relative entropy
is smaller and they are very similar.>>Yeah. So, in fact
at a certain point, once you’ve had
this many HARD queries that have been answered
on a static database, then you can say, “I don’t
even need to run this anymore, but I know that with
high probability, like public histogram
why is going to be good at answering
all the queries may be, am I right in my class?” So I can then just like publish
that as authentic data. Then that’s just like
a post-processing and so because that’s just like post-processing of the output
of summed differentiate private algorithm then it’s still like a differentiate
private itself.>>Or actually [inaudible].>>Yeah, correct. How
this growth can be really large? So that’s a bad thing. It’s sort of like, yes?>>So what is the model
of this growth like the batch of entries have
been outdated or like?>>Yes, good question. So here, we’re saying the queries are going to arrive online and the data points
can arrive online. Here, we’re showing, here
the really bad case, where I have like
use up a bunch of my HARD queries and then a whole bunch
of new data arrive, and it’s then going to
take me a whole bunch of more HARD ones to get back to the same place
that it was before. Because you start with
the entropy higher than it was before, even because of this alpha term, this means that you can have like an unbounded number
of HARD queries. So it can no longer make the same claim that
because they have this bounded HARD query budget, that at some point
I’m going to be done and I can say that I’m accurate. So like really think
of- yeah, yeah.>>So because you’re
only looking at adding. If I allow some multiplicative then it shouldn’t be a problem.>>Yes depending on-.>>Like, if I only care about one per epsilon multiplicative, I have plus something
like whatever. Because you’re saying,
at some point, if I insert too many then the histogram can
completely change.>>Yeah.>>You can basically
start all over again.>>Yeah, yeah. So what happens is that, it’s like imagine if there’s
like a starting point in the middle and imagine if we like project this space of all I can take databases
on like 2D plane. So I started in the middle at a uniform distribution but
the true axis like over here, and so I’m going to
slowly move over here to the process of a lot
of HARD queries and then some adversary can
dramatically change my axis way over here. So I’m going to have to
do like a whole bunch of more HARD queries more than even tough to get
there in the first place. So this can sort of like go back and forth basically forever. So we fix it. Actually, like really as soon as I get new data, I’m going to re-weight my y
to be closer to the uniform. So like in this example, instead of being over here, and then have x move over here, and I have to have
a whole bunch of HARD queries, have my database grew
that dramatically? I’m going to like
automatically shift back to center or close-ish. Then start moving through
HARD queries from there. So we’re saying, I’m gonna take my current y and I’m
going to re-normalize. Imagine that I have
n points and I gain one point that’s new
then I’m going to say, then I’m going to read
normalize to now this like convex combination
between my old y and the uniform distribution. This then bounds like a
potential growth in my entropy. So we’re good again.
So the main result of doing this as
that is like PMWG, where here G is for growing is really the exact
same thing as PMW, only we do this update
as soon as we receive new data entries and because the update doesn’t depend on the content
of the data, they don’t incur any privacy
loss from doing that.>>Sorry. There’s
this N. What is that?>>That’s the size of
the data universe.>>What is the n?>>That’s the size of
the starting database. So it says that I can be epsilon differential private
alpha-beta accurate. For this alpha, at each time T, and before I had
a fixed query budget, I could only answer K questions. But I can now enter
this thing that is growing exponentially in time. So as my database grows, I can now answer
exponentially more queries by harnessing the new data
that have arrived. There’s also
asymptotically no loss in accuracy compared
to the static case. We have a constant that’s
too ugly to print. So there is a constant loss but asymptotically, it’s fine. In the paper, we send this into epsilon delta
differential privacy and through these
stronger composition, it guarantees we can in fact improve alpha at
some small loss in our privacy.>>So the other thing you make to the public histogram that’s
completely invariant of what the new data is because
you’re only answering HARD queries or things where
the sensitivity is bounded.>>Yes.>>Okay.>>Yeah. It’s actually
critical that the update doesn’t depend on the data
because if you looked at the data and then
decided how to update.>>Yeah.>>Now, count towards
your privacy budget.>>Got it.>>Yeah.>>Clarifying question. So the algorithm
description at least, it seemed like the only source of randomness was coming in
from the query stream, I mean the kinds of
queries that are coming to the algorithm. Algorithm
is deterministic.>>Yes. I lied a
little bit there.>>Okay.>>In fact, I’ll point out where. But that’s a great catch.>>No, you lied noise,
I can show that. So, this probably said
how you add noise here, which is that you
provide a noisy answer to the query if it’s HARD. The part that I lied about it is that you
actually have to decide in a privacy preserving way whether a query is going to
be Easy or Hard. So, you have to do this randomly, and so you’re going to have some laws because
there’s going to be some queries that you say, “They’re EASY” and then you just publish like
a wildly wrong answer, but that’s a low
probability event. It’s also going to be
times we have a query that you’re correct on, you’re going to burn
through your HARD query budget incorrectly, but again that’s
a low-probability event.>>We didn’t lie,
just added noises.>>There you go. It was
a differential private truth. Tell that to our president.
Okay. Next slide. All right. So, that was
all that I had for this. Are there any questions
on this before I move on?>>It’s 30 plus
but then more it’s times this and this K
exponential root, and this is what you lose extra?>>That’s what we gain, an extra. Yes, the loss is
really just in it’s constant and you gain the ability to answer
more queries and the trick-.>>Can you go back to the-.>>Yes.>>Okay.>>The original results?>>Yes.>>This K. There’s
was no both of them? This has no constant.
So, going from K to K square root or?>>Exponential is true.>>Yeah.>>I mean there’s definitely
some constant K, right?>>Yeah. I think
the constant is-.>>Small.>>Yeah, it’s like
eight or something very reasonable like that. Yeah.>>What’s your result?>>My result is-.>>Do you have the sight?>>Yeah. All right.>>So, what is T?>>What was T?>>Capital T.>>Yeah, yeah for all times T.>>So, at timestamp, you
get K into the routine.>>Yes. These end here are a
bit misleading because we’re adjusting for this because like this K is
sort of incorporating some other things into
here and so the end is actually going to bounce
out to be the same, but it looks misleading. So, in fact this says, you can now have like
on every timestamp T.>>So, if I want to instantiate this static database
what should I set to T? Just HARD time.>>So, then you would
just set T equals N and then you would run
it once and you’re done. So, here we’re going to start at time N for more
database is of size N.>>I see.>>Yeah.>>This is not epsilon-delta?>>Right. We have
a different version that is epsilon-delta. Actually, the algorithm
is the same, but just the analysis of
it is different through these different
composition theorems.>>The relation to
the improvement you get when you relax it. epsilon-delta like is there
an even better privacy loss?>>I think it’s something like this K becomes a square
root K, something like that. You have a weaker privacy
guarantee and then constants.>>Where it is quantified
how the database is growing?>>I thing that is
what she means.>>Yeah. So, imagine that I get one new data point to
each time and so at time T, I have a database of size T.>>In each time you were
saying, yeah T square.>>So, we compare T to K to
the root 10. You’re right.>>So, you lose. I
mean their result is K like T to the power of T, exponential in T, but you
lose square root of T so it’s likely less than
what they could have.>>I mean they just have
like a fixed constant T.>>All right.>>Yeah. I’m sorry, they
have a fixed constant of-.>>It’s still a constant,
but I’m saying is you lose, like Morris result,
I can think of Ke to the power of T. But you
lose e to the power T squared. Because they can answer
exponential queries.>>At the same database size.>>Yeah.>>Same size, N>>You’re saying, we have
to be aware of that. So, what that says that
with error of one percent, you can answer
exponentially made queries.>>Yeah, that makes sense.>>Yeah. So, let’s take it offline because I have a part
two, let’s talk about it. Cool. So, now let’s talk
about more general things. It was great that we that we can expand this very
powerful private algorithm, but I might want to do
more things than just answer adaptive linear queries. So, in this part, we’re going to talk about
a general framework for being able to reduce in the dynamic setting
to the statics setting. So, really I’m saying I’m going
to take it as a black box some private accurate
static algorithm and they then want to use that as a subroutine in the case of a growing database to
solve the same problem. So, here’s the easy K’s. Maybe I have like N data points arrive and then I have K queries arrive and I want to answer these K queries on this database. Then it’s really easy
to see when I should run my static algorithm, it should be after I
have enough new data and enough new queries and I’m going to plug that in to my favorite algorithm
and rerun it. Great. But that’s not how
data growth actually looks. Maybe your data arrived in this very bursty way and perhaps you’re
like queries due to. It’s like at what point should a rerun my seems
static algorithm as a function of how I
grow my database, and of my queries
arriving online. I’ll say again at time T, think of a database of
size T. I’m going to have like one new point arriving
at each time step. So, my T is going to count both my time and
the size of my database. So, this is the same accuracy
definition as before, but I now need to define
this somewhat strange thing. That’s going to be important
in the in any analysis. As a PG black-box and he really just going
to like I decompose my alpha guarantee into
something that depends on Beta Epsilon and
N and raised to some power p and then there’s just gonna be
some garbage out front. I think of this as some like our
problem-specific parameters. Maybe it’s a like
dimension of my data. Maybe it’s my constants. Maybe it’s some like you
know convexity parameter. But there’s going to be
solid like things out front. Our accuracy guarantees
are going to depend on on this power P and
so that’s why it’s important that we pull that out. So, if you see a PG black box, think of I’m dislike
a partitioning my Alpha into things
that depend on Beta, Epsilon and N and then some other like our
problems specific parameters. To make this more concrete, before I tell you
the more general result I’m going to show you a PG black box and how you can reschedule runs of
this black box algorithm has a function of
your data growth. So, we have
the small DB algorithm that was by Blum that get Roth in 2008 and it’s a differential
private algorithm for creating synthetic data. So, it takes in a database X of size N and we have a data; universal capital N
and a query class F, and it’s just going to produce
a database Y of size that depends on log of my query class over my desired alpha squared. It’s just going to sample
Y with a bias towards databases that can closely approximate my true
database in terms of my query class F. So, think of this as like
a non-adaptive version of the BMW algorithm, how to specify upfront all of my. How they have to
specify up front. Do you like a query
class that I care about. And I’m then going to publish some y that gives me approximately
correct answers with high probability on that. And here’s the SmallDB theorem. It is Epsilon
differentiate private, and it’s alpha, beta
accurate for this alpha. This is pretty ugly, but let’s now decompose
this into RPD black-box. So, P is our power and so we have Log one over beta all over Epsilon squared
to a power of one-third. So, my power is one-third, and then I have this other stuff, the constant turns out to
be 64 up to this cube root. So, I have a power, P, and then I have some other stuff that
depends on my problem. So, the PG black-box
for these parameters. So, let’s see how we
might rerun this at various times as
my database grows. So really, I’m just going to rerun the same static algorithm
at different times, at times, t_i equals one plus gamma to
the power i times n for some gamma less than one. I am going to each time
produce a database, y_i. What’s happening here? I’m just going to, like I’m exponentially delay
the runs of my algorithm, a mile ecstatic algorithm, and its perimeter beta or gamma tells me the base
of that exponent. So, if I have gamma equals one, then it’s like every time
my database grows by size n. If it’s- if it is zero, then it’s linear growth, if it’s one then it’s exponential growth and so this gamma
can sort of tune. How quickly do I want to rerun this algorithm as
a function of my growth? So, I have this like
a produced alpha stream, y_i, and I’m just going to use
that database to respond to all the queries that arrive until I rerun Smalldb the next time. Come on. Okay. I want a key is this exponential delay and the optimal tuning of an gamma. So, doing this allows us to work in a growing
databases setting. And things are basically
the same as before, but I now have
this one-fifth here instead of a one-third power before. So, this is a concrete example. Who wants to see
the real thing? Cool. No one. Yes. Booo. Okay.
Huh? Lock the doors. Okay. So here’s the whole thing. I want to walk you
through each step, but the key idea is that I’m going to take a PG black-box. I’m going to rerun it at times that are
exponentially spaced out. I’m going to get
some accuracy guarantee that has this loss, one over two p plus one in power compared to the same problem
in static setting. Because before, I had
this PG black-box, and so it was G times
this stuff to the power, p and I just lost in the
power of my accuracy. That’s it. Okay. So, these
things, what these all mean? So, the gamma is
a base of my exponent, and nodes that it is
independent of the data growth, and so this is just
fixed over time. This is the exponent and
the delay of how often I should rerun by fixed private
accurate static algorithm. The Epsilons, we tune them
at each stage to make sure that our composition
gives us that adds up to our overall
Epsilon guarantee. So, we have some desired
Epsilon and I just tune this to make sure that
if I sum for all times, this is going to converge
to be just Epsilon. The alpha at each time, t_i, is going to be this. This is just
the accuracy guarantee. You have a static algorithm. So, notice if I have
a PG black-box, this is just the alpha that would happen naturally if I run
the algorithm at time, t_i, and I want to do the same thing for
my betas as I’m going to have this independent failure
probability at each round. I just want my Thetas to
converge to some overall data, so that I can take a union bound over their
failure probability. Although there’s a lot
of things happening, it’s actually very simple. I’m taking a fixed
static algorithm and I’m just going to rerun it at times that are appropriately spaced out in terms
of my data growth. Some extensions of this in
the paper are that we give this Epsilon Delta
differentiate private version. So, here again, we have
this improved accuracy at the cost of a slightly
weaker privacy guarantee. We also give
a different algorithm in the same spirit that now has an accuracy guarantee
that improves over time. So, this says that my alpha
is not just some constant, but in fact, my accuracy is going to approach perfect as
my database grows large. So, this is appropriates
who you think of problems for your data are
going to be sampled IED, and you expect as you have that’s arbitrarily large sample, then you should be
doing arbitrarily well. So, this notion is
more appropriate for those type of problems. This alpha, T, starts out weaker than the fixed
accuracy guarantee, but it becomes stronger once the database roughly
squares in size. We also extend how we apply
this algorithm to the case of empirical risk minimization
where it makes sense that if you
have a larger sample, then you should be doing better. What time am I supposed to end?>>1:30.>>Okay. Okay, So then, I will quickly just
state the results here. I’m not going to give
the ERM background. So, if you know what
this will be cool, if you don’T, then you can
take a nap for 30 seconds.>>So, why is this awards like the high level intuition of the previous result, like is it->>For this one?>>The other one. The theorem is
stated like you’re-.>>This.>>Yeah.>>Yeah.>>For, I use it->>So, the takeaway is that, is that really the loss and
the accuracy for extending a static algorithm to the growing case just
depends on this power. And so this is like
the only loss that you get. It just depends on the original alpha
that you start with. This can also go on forever. So, here as long as
your database continues to grow, then you can continue to
answer queries about it. And you can still
maintain the same->>Because the way I understand your algorithm is
that you are running this black-box that you covered them at certain chosen
intervals, right? Once you fix gamma,
these intervals are also like opposite, like maybe exponentially
growing size.>>Yeah.>>Then are you just going to apply composition
on it start like, apply competition on it or is it no?>>Yeah, yeah, yeah because the static algorithm says
pull up high probability. I’m going to be
approximately accurate. And so as long as you
can take a union bound over those high probability
events forever, then you’ll still have
the same high probability event. They’re going to be
correct for all queries forever as long as you continue
to accumulate more data.>>Sure. Okay.>>Yeah, yeah, and also assumes that we know nothing about how
this algorithm works. They don’T have to know
specifics about it. I’d just say, hand
me something that is accurate and private, and boom, it now works
in a static setting. Yeah.>>What if your database is
also growing and shrinking? If there’s data points that
are now being removed, this is like
a real world constraint that we see sometimes, right? Is that flexible? Like if you gained 10 data
points but you lose three.>>Maybe you can let
her finish her talk.>>Yeah.>>I’m sorry.>>Yeah. But that’s actually a very good question
and in two slides, I will talk about that
to my open questions. Because that’s very, very real and we acknowledge that this is one very specific type of growth, but there’s lots of other types of changes that may happen.>>Sorry to interrupt.>>Yeah. Sure. That’s fine. Anyways, this is my last
like technical slide. So, I’ll just sort of
put things up there. So, for applying this algorithm where the accuracy guarantee
improves over time. Here’s what we get for
our empirical risk. So, here we have to have Epsilon Delta
differential privacy to get meaningful composition. So, if you have D-dimensional samples
with high probability for all times T, you’re going to
output a classifier that has excess
empirical risk of this. And comparing that to the static case for had
Epsilon differential privacy, and this is our black box that we use because we apply this and we just rerun this
every single time. So, our like dependence on
D and beta is the same, if we have some loss
because of our delta. How we have on the bottom here, we have like root T
compared to their T, and that’s like the only loss. So, this says if you want to do here I’m privately on
a growing database, this works. Pick your favorite
static ERM algorithm and it will work in
this environment too. And of course, like once
your T grows large, then you’re going to
be doing very well. In the paper we also get stronger bounds if you have more assumptions on
the loss function like, if it’s strongly convex then
you can do better as well. Okay. So, the open questions and I think
that’s actually really important because this is one very particular
type of growth, but there’s a lot of other practical types of
growth as you pointed out. Here are various questions
that I think can follow from this that are
basically completely open. How can you do better if you have additional assumptions
on the data? So, here we assumed adversarial data growth
in the worst-case, maybe data points are
well sampled IID. Then, you should be
able to do better. How much better? I don’t know. Maybe they aren’t
quite stable with IID, but they’re sampled from some smoothly changing
distribution. So, things that you have
learned in the past are still like mostly correct and you may want to adapt
them a little bit more to accommodate for
these smooth changes. How should you do
that? Don’t know. You also might want to identify when you have these like dramatic
changes in your data. So, I have a follow-up paper
with myself, Sara Kreeble, Yejin May, Tora Ray, and Ron Sing, where we do a differentiate private
change point detection. Of saying like, “I
want to identify. Whoa, the samples I’m seeing now are dramatically different from things I saw in the past.” So, this sort of like pairs
nicely with this how it’s saying it’s not
changing smoothly, but it’s changing once
and quite dramatically. You may also want to do
some sort of like AB testing. As your points arrive online, you want to privately say, “I think my A treatment and B treatment are the
same or they’re not.” It’s also like lots
of other types of growth that may happen. For example, he may want to
have sound like changing data points that
can churn or can be deleted or like removed, and that is not all
accounted for in this paper, but that’s a very practical
and very real type of type of change that we see. There’s also this case of
a horizontal Theta growth. I mean if you view
data as a matrix. How I have like a row
for each person? Maybe, I’m not just like
accumulating new rows, as in this paper, but I’m accumulating new columns
about the same people. That’s also not
accounted for here, but there’s also like very
real and very practical. So, I think these questions are completely open and
very important in terms of how to do them
privately. That’s all, thank you.>>So, can we have
more questions if you guys want. Also, Rachel is here
until Friday so if you want to call me
for some one on one, shoot me an e-mail and
I’ll set something up. More questions now?>>We know lower boundaries in a growing database has to be harder than the full data
basically, yes?>>It actually doesn’t have to be in this particular case, and that’s how this case
is dramatically different from these two cases. It really boils down to
this sensitivity of each query, of being of size, of being of value, one over the size
of the database. So, as you get a larger
and larger database, each person’s data
point matters less, and so, you can sort
of like naturally answer more queries
on a larger database. So, in some sense
that’s really at the crux of all of
these results is we’re saying, “As long as your data grows faster than your queries arrive, then you’re fine
forever.” That’s->>Can you compare to the data. This is, you don’T go forever.>>Yeah.>>When compared to
a static size database->>Oh, yeah.>>-then you less
than had the all the things in the beginning?>>Yeah. Yeah. Yeah, of course.>>These guarantees
are worse than what you get in that case.>>Yes.>>The question is
do they have to?>>They have to be, yes. Because if you’re going to wait until everything arrives and you’re just going to
answer queries then, then you’re only
going- Oh, no. Yeah.>>I will answer. Right,
I do answer them online. The question is, does that mean->>Compared to.>>Yeah. Compared.>>Yeah and if you wait until the very end until you
have everything arrived, and you just then
answer all the queries, that’s saying that
your privacy budget is going to be much lower
compared to if you had to answer
these queries a lot of times in the middle
and at the very end. There is like a natural
privacy accuracy trade-off. How the more queries you answer and the more privacy
budget you burn through, necessarily then your accuracy has to be worse to compensate. So, you see a lot of these->>In the beginning you’re like the sensitivity of
the queries is also larger. Like if there is only one person, then of course if you
wanted to answer it then what became changes
much higher then. So, initially it rose rapidly
and then it sort of like.>>Yeah. Yes. Well, the first best is just to
wait until you have all the data in the world and you just answer queries then. But of course, that’s sort
of useless as an analyst because you want information
in the meantime. So this is saying, “You’re doing not too much
worse if you are able to continually
answer queries in the meantime in addition
to at the very end.”>>This model is
very similar to like dynamic algorithms,
exactly there.>>Yeah. Yeah. A lot of the things in the privacy space they use the term “Dynamic”, they mean, the queries
arrived online. So, it’s like adaptively arriving queries for a fixed
database and this is more or less
the first work that considers dynamically
changing database.>>Another thing would be nice, just I think the second point, is that only when actually the solution changes
I will sort them? How many times, my action,
my Instagram changes?>>Yes. Yeah.>>So, that we can
quantify in terms of that.>>Yeah. Yeah.>>That’s your second point.>>So, I’ll say
this paper just said how to find out when
a change occurred, but it doesn’T combine
that with these results, which I think that is also
open is like saying, “So, I now know that I have
different data and I’m only then going to rerun
these algorithms.” I think like quantifying how much better that is over this case is sort of tied
into all these questions.

Danny Hutson

2 thoughts on “Differential Privacy for Growing Databases

Leave a Reply

Your email address will not be published. Required fields are marked *