>>So, hello and

welcome everyone. It’s my great pleasure

to host Rachel Cummings. Rachel is a Professor at the Industrial Engineering Department and Georgia Tech. She works on

differential privacy, game theory and connections

between the two. She’s a recipient of multiple awards assignments

[inaudible] graduate, is it fellowship award, a moody Doctoral Prize

award, [inaudible]. So, she’s going tell us about differential privacy

for growing databases.>>Thank you. Thanks guys, I’m very excited to be here. I was told this was

a very casual environment, and so feel free to jump in

and ask questions it’s much more important that

you understand that I go through all the slides, and I also have a

mild like I’m speech impairment and so if

I pause briefly I promise I’m not just

daydreaming during my talk. So, I’ll talk about

joint work with some colleagues

and collaborators. Sara Kreeble is a former Georgia Tech

grad student now faculty at University of Richmond, and cabinet tower current grad

students at Georgia Tech. I’ll be talking today about how we can extend

privacy guarantees to the case where our database

is going to be growing and changing over time.

What do I mean by that? So the olden days, data science meant that

you collect data once and you analyze it and then you

go home and the world ends. So this look like things

like the set senses, I get new data every 10 years. Maybe like a survey that

was mailed out and you had to first build a survey, mail it out and then wait three months to get

the responses back, or if you’re a very fancy, maybe there was a call center

and you had to like train these people to call and ask very specific questions

and they collected data, and you analyze it at the end. Modern data science

looks very different. It happens online and it’s

changing all the time. We’re now doing things like collecting everyone’s health

information and their tweets and

their social behavior online, and maps, and so we now have to do like a continuous analysis

of a growing database, because we’re acquiring

more data over time. We also can’T wait until we accumulate all the tweets

in the world and then decide what’s trending, but in fact I have to tell you all the time on

the current database the current tweets and

the current trends. So, we need new analysis tools for the case of

growing databases. But in these examples, it’s also like

highly sensitive data, and so this is

your health information, and this is your social data, and your GPS data, and so we also need some kind

of privacy guarantees, because we’re now collecting

much more sensitive, much more personal

information about people. So, some background,

differential privacy was defined in 2006 byte work makes sharing nice Siemens Smith, and it informally bounds the maximum amount the

one person’s data can change the output of some computation performed

on a very large database. Formally it says this, “So, an algorithm M, the maps from an N tuple

of types,” and so think of that as a data collected

from N different people, just an arbitrary range is parametrized Epsilon

differential private. If it’s the case that for

all neighboring databases, they’re the same except for

one person’s data and then an output approximately

the same thing on these two

neighboring databases. By approximately, I mean

we’re going to compare the probability that

the output anything S, and the probability

that I produce S should be multiplicatively close up to this e to the Epsilon factor. So, in pictures it

looks like this, I have some blue database and they produce

some blue curve over there, and I’m then going

to change person i’s data from t_i to being t_i prime, and then going to get

the red curve instead. So, these things are close. So, I can then bound

the impact of person i’s data. So really I’m saying, if you provide your data

or if you don’T, if you’re truthful or if you lie, I’m going to learn

about the same thing anyways until your portion of my dataset has

a very small effect on the thing that I learn

at the end of the day. So, think of this set S as the collection of

all bad outcomes, maybe this is like I’m going to analyze our health data to decide insurance rates

and decide if like smoking causes lung cancer. Like saying like

your insurance price is going to rise to

the same probability if you share your data

versus if you don’t T. So think of this as

like the collection of all like bad outcomes that may happen as a result of

the analysis of this data, and then saying, those things

are going to happen with about the same

probability anyways, and so I’m really not

relying on your data. It’s also a very strong like

worst-case guarantee. So it’s worst-case over all pairs of

neighboring databases, and over all collections of

all possible bad outcomes, and so this is very very strong. Any questions so far?

Okay. There’s a couple of nice properties that make differential privacy desirable as a tool for designing algorithms. So first, it’s robust

to post-processing. So if I have some differential

private algorithm M, and I then apply

any other function to it, the post process thing

is still going to have the same differential

privacy guarantee, and so really this

says no adversary can take the output of my

differential private algorithm, going to corner,

think really hard, run some additional

computations on it and then like reverse

engineer my data. So, I’m promising no adversary can break this privacy guarantee. So, note also, there’s

no requirements on the computational

power of the adversary, the rather this is a like information

theoretic guarantee. It also well composes adaptively. So, this says if I have K different private mechanisms and they’re each Epsilon i

differentially private, then if I were to

compose all K of these, then I can just add up

my Epsilons and that gives me my final privacy guarantee. So, really if I want to run a whole bunch of

different Epsilon differentially private

computations on the same data, the Epsilons are just

going to add up. There’s more advanced

versions of this that I won’T include

today, for example, this still holds if you choose

these M_i’s adaptively, how you first run M_1

and see the output of that and then decide on which M_2 to run next, and

this still holds. Here, there’s also like I’m, relaxation of this where you have an additional delta

term over here, as like an additive slack, and so I then have this

multiplicative parameter Epsilon, and that I have as

an additive slack delta. I won’T talk about it

today but there’ll be some like brief mentions in passing about this like

Epsilon delta version, and so just keep in mind

that’s what the Epsilon Delta is there’s this like

additive delta term here. So, there is a more advanced

composition version where the Epsilons don’T

just add up but in fact it looks like

the L_2 norm of the Epsilons, and so you get

much better composition in the Epsilons where you have to have is like additive loss

in the delta. So, you can do better than this, but this is sort of like

the most simple version. Okay. So, back to why

differential privacy is great. So, because we have this

parameterized privacy notion I can now move smoothly between

informational extremes. I no longer have to say that it’s either

going to give you data or not but I can actually like quantify how much

information am I leaking to my epsilon by performing this particular

computation on my dataset. We saw that it has these like a post-processing in these

composition guarantees. Which mean if you want to design differential private

algorithms then you only have to reason about these like very simple

subroutines that are private and then you can

just add up the epsilons in your entire larger more

complicated algorithm. It’s being used in

practice and in theory, it’s being used here. Thanks guys for inviting me. So it’s being used in practice by Apple, Google, Microsoft, Uber and these census bureau. We now have a rich background of a differentiate private

algorithms that are both private and have some usefulness

guaranteed. So it seems great. The problem is nearly everything that’s been done operates

on a static database. So we have great tools

for static database, we have virtually nothing

for growing database. Previous work only

allowed us to count bits in a stream and that’s

all we could do privately. So today we’re going to do a lot more stuff privately

on a growing database. First we’re going to see a very specific tool

for answering adaptive linear queries

and that’s going to be based on the algorithm

private duplicative weights. And we’ll see why it can’T just extend naturally to

the growing database case. We’ll show how to improve it to make it work and

then we’ll talk about much more a general purpose

results and how we can extend existing

accuracy and privacy guarantees in the static case into the growing environment. Finally, this is a first paper on this topic and I think

that there’s lots of other types of growth to be analyzed and

so I’ll talk more at the end about questions

raised by this work. Any more question

where I jump in? Cool. Okay so the first part is on private multiplicative

weights for growing databases. This is a great algorithm

to answer exponentially many adaptively chosen

queries and so that’s a super powerful tool but it doesn’T naturally

extend to growing databases. Just one slide of that

preliminaries first. So, for this part we’re going

to store our database as a histogram and so think

of a database as a vector, where the Ith entry

is the fraction of the entries in the database

that are of data type I. So there’s N total types of

data and I’m just going to count how many of each

type have I seen. So for example if this

is a binary data, I want to count like do

you have a disease or not. There’s two types and you

can either have a plus or minus and that’s it and so then this is going to be a tuple. Maybe I want to count

how many people of which each age they are

and so then this is maybe 120 if I’m being

generous. Things like that. So, we’re going to be

thinking in this case about just answering linear queries where each query it’s going to also be described as

a vector and it will assign a weight to

each type of data X_I. We’re going to assume that

these weights are bounded between zero and one. An important thing that’s

going to follow us through the talk is the sensitivity

of a real valued query. Which is just going to be the maximum change

in that query f, if I were to change

one person’s data. So I’m going to say what’s the worst case over all

neighboring databases, how much can I change the

value of this function f. For these query classes

we’re going to have Delta f be of size one over n for a database of size n. Finally we want algorithms

that are going to be accurate and our accuracy notion is of this Alpha-Beta accuracy. So let’s say that an algorithm M, is going to be

Alpha-Beta accurate with respect to a query

class F. So here think of the class of

all them linear queries. If it’s the case, then

for any database f, I’m going to have a highly

accurate answer is produced by my algorithm for all queries in that class with

high probability.>>Your definition

of linear queries looks like you’re

not normalizing but in sensitivity looks like

you are. So you are?>>Yeah she’s normalizing.>>Yes. I am normalizing.>>Yeah a little where it’s one.>>Yeah and in fact is important here

that it is normalized. So, yes thank you. Yeah.>>Okay.>>This is a slight

abuse of notation here this is imagining that I can apply this query into the output produced by

my mechanism and so just like a grant me that for

the ease of notation here. Okay. So, here’s a PMW algorithm. It’s by Hard Roth

bloom in 2010 and answers exponentially many adaptively chosen linear queries. It works by maintaining

a public histogram Y. So, it contains

all the information that we know so far at

each point in the algorithm. So, these new queries

are going to arrive online and the algorithm

is first going to decide, is my Y so far pretty good at providing an answer to this query and if

the answer is yes, then it will say that

it’s easy and I can just output f of y

and I don’T have to touch my true database X. If not then it classifies

f as “Hard” and it will output a what was the answer to f and will

then update Y accordingly. The key idea here is

that you only get significant privacy loss in the hard queries

because for these ones, I really just have to

check is my answer on Y close to what my answer

is on the true database. So I’m not really

using much information about my database and this

work space showing that well, as time goes on, this Y is going to converge

to X and so I’m going to see fewer and fewer hard

queries over time and so therefore I’m not getting

very much privacy loss overall. In pictures, here’s

how this works. I have a a public histogram y, say that it starts being uniform and I’m then

going to get a query and say for example

this is my query f_1 and here is the vector

corresponding to this f_1. And imagine that

this query over predicts. So, the value is larger

on Y than it is on X. So this is going to

be classified as hard query and then just going to like down-weight y prime

according to this vector. So for example, because my loss on one

and two are too large, I’m going to decrease

the weight placed on the types of data, one and two here hope

they’re lower and then going to normalize and that pushes

more weight onto four. The opposite could

happen for a hard query. Maybe my f_1 look like this and I then under

predicted we’ll then going to do the same

multiplicative weights update to adjust my public histogram Y. The algorithm works by just

taking in queries over time and then if it ever see the hard query then it

performs this update. The guarantee have the static

algorithm is as follows. It takes in a database,

a query class, privacy parameter,

accuracy parameters, and a size of a database and it is on

epsilon differential private, and it’s also accurate for K

queries and for this Alpha. The important thing to

note is that there’s this one-third power

because that’s what’s going to change in

our extension of- oh, actually now I think

about it. Yeah?>>What Alpha

notation is accurate.>>Alpha is the additive.>>Yeah. I already

did that one. Okay.>>Yeah. It relies on this potential argument

that says well eventually my Y is going to converge to my X and so these things are

going to be very close. So I have to show that this

entropy start slow and it goes down fast as it answer more hard queries and

it’s also bounded below. So it’s a standard

potential argument. So, for the fixed database case

it looks like this. Starts at most log

n. By the way that the algorithm initializes

things like initializes Y through the army uniform

histogram and so it starts at log n. Here we can also quantify as a function of the learning rate how quickly to this potential decrease

and it turns out that for every HARD query it goes down by this term Alpha

squared over four. If you tune the optimal

learning rate and is bounded below

by zero. So great. What happens if

the database is growing? Do these things still hold? Oh, horrific. Okay good. So this case, we have at most four log N

over alpha squared HARD queries and then

just by like composition, I can then guarantee that my epsilon loss is

going to be small.>>In terms of privacy loss, this theorem is

stating that instead of N epsilon privacy loss, you’re only getting

log N epsilon privacy loss?>>Yes.>>Yeah.>>This N is the size

of the data universe. But if you look at this like, it is like alpha

squared term here.>>There is log K there.>>Oh yes, yeah. Yeah good, yeah.>>So you can actually, okay.>>Yeah.>>Competition

theorem will give you K epsilon but now you

are saying log K.>>Yes, exactly. So this is why it allows

you to answer exponentially many queries because

once alpha hits one, then you’re not really

learning things anymore. So I can now answer- and

so I can now have a K that is like an exponential in my N before my accuracy loss

gets too large, yeah. The real crux of this is because you don’T actually answer everyone is just because you like only answer

a small number of them.>>In fact, it’s the property of multiplicative rate

of data algorithm.>>Yeah. Okay, so this

all works for a static. Database is very powerful. This is an amazing algorithm

but it doesn’T work if you allow more data to arrive. So we have this part of our database of size N,

for this picture. Then let’s say new data arrive. Well, my entropy can

go back up as they add more entries and in fact that’s going to

be the major problem. We can still show that

it’s going to go down a significant amount

for these hard query and it still bounded below, but the size of this change

can be a really big deal, and in fact it can

look like this. So as your database

doubles in size, you can have this log

N over alpha growth, which is going to be fair, yeah?>>Just we’re talking about the previous theorem

just to check. So at the end, there’s almost no

HARD query because the public databases has

revealed the private database. So all the secrets are out, there’s no more secrets

still, results.>>That’s one way

of looking at it. The other way of

looking at it is that I have found a database, this shares the same

statistical properties as my original database, but may or may not

be exactly the same. So it enables you to answer all the queries in

your query class that you care about but I’m also not guaranteeing that

it’s the same database.>>You said the query,

the static query classes is going to smooth and maybe different from

the competitor.>>Yeah.>>The entropy looks like the x. Entropy is smaller than the databases

are really very high. The relative entropy

is smaller and they are very similar.>>Yeah. So, in fact

at a certain point, once you’ve had

this many HARD queries that have been answered

on a static database, then you can say, “I don’t

even need to run this anymore, but I know that with

high probability, like public histogram

why is going to be good at answering

all the queries may be, am I right in my class?” So I can then just like publish

that as authentic data. Then that’s just like

a post-processing and so because that’s just like post-processing of the output

of summed differentiate private algorithm then it’s still like a differentiate

private itself.>>Or actually [inaudible].>>Yeah, correct. How

this growth can be really large? So that’s a bad thing. It’s sort of like, yes?>>So what is the model

of this growth like the batch of entries have

been outdated or like?>>Yes, good question. So here, we’re saying the queries are going to arrive online and the data points

can arrive online. Here, we’re showing, here

the really bad case, where I have like

use up a bunch of my HARD queries and then a whole bunch

of new data arrive, and it’s then going to

take me a whole bunch of more HARD ones to get back to the same place

that it was before. Because you start with

the entropy higher than it was before, even because of this alpha term, this means that you can have like an unbounded number

of HARD queries. So it can no longer make the same claim that

because they have this bounded HARD query budget, that at some point

I’m going to be done and I can say that I’m accurate. So like really think

of- yeah, yeah.>>So because you’re

only looking at adding. If I allow some multiplicative then it shouldn’t be a problem.>>Yes depending on-.>>Like, if I only care about one per epsilon multiplicative, I have plus something

like whatever. Because you’re saying,

at some point, if I insert too many then the histogram can

completely change.>>Yeah.>>You can basically

start all over again.>>Yeah, yeah. So what happens is that, it’s like imagine if there’s

like a starting point in the middle and imagine if we like project this space of all I can take databases

on like 2D plane. So I started in the middle at a uniform distribution but

the true axis like over here, and so I’m going to

slowly move over here to the process of a lot

of HARD queries and then some adversary can

dramatically change my axis way over here. So I’m going to have to

do like a whole bunch of more HARD queries more than even tough to get

there in the first place. So this can sort of like go back and forth basically forever. So we fix it. Actually, like really as soon as I get new data, I’m going to re-weight my y

to be closer to the uniform. So like in this example, instead of being over here, and then have x move over here, and I have to have

a whole bunch of HARD queries, have my database grew

that dramatically? I’m going to like

automatically shift back to center or close-ish. Then start moving through

HARD queries from there. So we’re saying, I’m gonna take my current y and I’m

going to re-normalize. Imagine that I have

n points and I gain one point that’s new

then I’m going to say, then I’m going to read

normalize to now this like convex combination

between my old y and the uniform distribution. This then bounds like a

potential growth in my entropy. So we’re good again.

So the main result of doing this as

that is like PMWG, where here G is for growing is really the exact

same thing as PMW, only we do this update

as soon as we receive new data entries and because the update doesn’t depend on the content

of the data, they don’t incur any privacy

loss from doing that.>>Sorry. There’s

this N. What is that?>>That’s the size of

the data universe.>>What is the n?>>That’s the size of

the starting database. So it says that I can be epsilon differential private

alpha-beta accurate. For this alpha, at each time T, and before I had

a fixed query budget, I could only answer K questions. But I can now enter

this thing that is growing exponentially in time. So as my database grows, I can now answer

exponentially more queries by harnessing the new data

that have arrived. There’s also

asymptotically no loss in accuracy compared

to the static case. We have a constant that’s

too ugly to print. So there is a constant loss but asymptotically, it’s fine. In the paper, we send this into epsilon delta

differential privacy and through these

stronger composition, it guarantees we can in fact improve alpha at

some small loss in our privacy.>>So the other thing you make to the public histogram that’s

completely invariant of what the new data is because

you’re only answering HARD queries or things where

the sensitivity is bounded.>>Yes.>>Okay.>>Yeah. It’s actually

critical that the update doesn’t depend on the data

because if you looked at the data and then

decided how to update.>>Yeah.>>Now, count towards

your privacy budget.>>Got it.>>Yeah.>>Clarifying question. So the algorithm

description at least, it seemed like the only source of randomness was coming in

from the query stream, I mean the kinds of

queries that are coming to the algorithm. Algorithm

is deterministic.>>Yes. I lied a

little bit there.>>Okay.>>In fact, I’ll point out where. But that’s a great catch.>>No, you lied noise,

I can show that. So, this probably said

how you add noise here, which is that you

provide a noisy answer to the query if it’s HARD. The part that I lied about it is that you

actually have to decide in a privacy preserving way whether a query is going to

be Easy or Hard. So, you have to do this randomly, and so you’re going to have some laws because

there’s going to be some queries that you say, “They’re EASY” and then you just publish like

a wildly wrong answer, but that’s a low

probability event. It’s also going to be

times we have a query that you’re correct on, you’re going to burn

through your HARD query budget incorrectly, but again that’s

a low-probability event.>>We didn’t lie,

just added noises.>>There you go. It was

a differential private truth. Tell that to our president.

Okay. Next slide. All right. So, that was

all that I had for this. Are there any questions

on this before I move on?>>It’s 30 plus

but then more it’s times this and this K

exponential root, and this is what you lose extra?>>That’s what we gain, an extra. Yes, the loss is

really just in it’s constant and you gain the ability to answer

more queries and the trick-.>>Can you go back to the-.>>Yes.>>Okay.>>The original results?>>Yes.>>This K. There’s

was no both of them? This has no constant.

So, going from K to K square root or?>>Exponential is true.>>Yeah.>>I mean there’s definitely

some constant K, right?>>Yeah. I think

the constant is-.>>Small.>>Yeah, it’s like

eight or something very reasonable like that. Yeah.>>What’s your result?>>My result is-.>>Do you have the sight?>>Yeah. All right.>>So, what is T?>>What was T?>>Capital T.>>Yeah, yeah for all times T.>>So, at timestamp, you

get K into the routine.>>Yes. These end here are a

bit misleading because we’re adjusting for this because like this K is

sort of incorporating some other things into

here and so the end is actually going to bounce

out to be the same, but it looks misleading. So, in fact this says, you can now have like

on every timestamp T.>>So, if I want to instantiate this static database

what should I set to T? Just HARD time.>>So, then you would

just set T equals N and then you would run

it once and you’re done. So, here we’re going to start at time N for more

database is of size N.>>I see.>>Yeah.>>This is not epsilon-delta?>>Right. We have

a different version that is epsilon-delta. Actually, the algorithm

is the same, but just the analysis of

it is different through these different

composition theorems.>>The relation to

the improvement you get when you relax it. epsilon-delta like is there

an even better privacy loss?>>I think it’s something like this K becomes a square

root K, something like that. You have a weaker privacy

guarantee and then constants.>>Where it is quantified

how the database is growing?>>I thing that is

what she means.>>Yeah. So, imagine that I get one new data point to

each time and so at time T, I have a database of size T.>>In each time you were

saying, yeah T square.>>So, we compare T to K to

the root 10. You’re right.>>So, you lose. I

mean their result is K like T to the power of T, exponential in T, but you

lose square root of T so it’s likely less than

what they could have.>>I mean they just have

like a fixed constant T.>>All right.>>Yeah. I’m sorry, they

have a fixed constant of-.>>It’s still a constant,

but I’m saying is you lose, like Morris result,

I can think of Ke to the power of T. But you

lose e to the power T squared. Because they can answer

exponential queries.>>At the same database size.>>Yeah.>>Same size, N>>You’re saying, we have

to be aware of that. So, what that says that

with error of one percent, you can answer

exponentially made queries.>>Yeah, that makes sense.>>Yeah. So, let’s take it offline because I have a part

two, let’s talk about it. Cool. So, now let’s talk

about more general things. It was great that we that we can expand this very

powerful private algorithm, but I might want to do

more things than just answer adaptive linear queries. So, in this part, we’re going to talk about

a general framework for being able to reduce in the dynamic setting

to the statics setting. So, really I’m saying I’m going

to take it as a black box some private accurate

static algorithm and they then want to use that as a subroutine in the case of a growing database to

solve the same problem. So, here’s the easy K’s. Maybe I have like N data points arrive and then I have K queries arrive and I want to answer these K queries on this database. Then it’s really easy

to see when I should run my static algorithm, it should be after I

have enough new data and enough new queries and I’m going to plug that in to my favorite algorithm

and rerun it. Great. But that’s not how

data growth actually looks. Maybe your data arrived in this very bursty way and perhaps you’re

like queries due to. It’s like at what point should a rerun my seems

static algorithm as a function of how I

grow my database, and of my queries

arriving online. I’ll say again at time T, think of a database of

size T. I’m going to have like one new point arriving

at each time step. So, my T is going to count both my time and

the size of my database. So, this is the same accuracy

definition as before, but I now need to define

this somewhat strange thing. That’s going to be important

in the in any analysis. As a PG black-box and he really just going

to like I decompose my alpha guarantee into

something that depends on Beta Epsilon and

N and raised to some power p and then there’s just gonna be

some garbage out front. I think of this as some like our

problem-specific parameters. Maybe it’s a like

dimension of my data. Maybe it’s my constants. Maybe it’s some like you

know convexity parameter. But there’s going to be

solid like things out front. Our accuracy guarantees

are going to depend on on this power P and

so that’s why it’s important that we pull that out. So, if you see a PG black box, think of I’m dislike

a partitioning my Alpha into things

that depend on Beta, Epsilon and N and then some other like our

problems specific parameters. To make this more concrete, before I tell you

the more general result I’m going to show you a PG black box and how you can reschedule runs of

this black box algorithm has a function of

your data growth. So, we have

the small DB algorithm that was by Blum that get Roth in 2008 and it’s a differential

private algorithm for creating synthetic data. So, it takes in a database X of size N and we have a data; universal capital N

and a query class F, and it’s just going to produce

a database Y of size that depends on log of my query class over my desired alpha squared. It’s just going to sample

Y with a bias towards databases that can closely approximate my true

database in terms of my query class F. So, think of this as like

a non-adaptive version of the BMW algorithm, how to specify upfront all of my. How they have to

specify up front. Do you like a query

class that I care about. And I’m then going to publish some y that gives me approximately

correct answers with high probability on that. And here’s the SmallDB theorem. It is Epsilon

differentiate private, and it’s alpha, beta

accurate for this alpha. This is pretty ugly, but let’s now decompose

this into RPD black-box. So, P is our power and so we have Log one over beta all over Epsilon squared

to a power of one-third. So, my power is one-third, and then I have this other stuff, the constant turns out to

be 64 up to this cube root. So, I have a power, P, and then I have some other stuff that

depends on my problem. So, the PG black-box

for these parameters. So, let’s see how we

might rerun this at various times as

my database grows. So really, I’m just going to rerun the same static algorithm

at different times, at times, t_i equals one plus gamma to

the power i times n for some gamma less than one. I am going to each time

produce a database, y_i. What’s happening here? I’m just going to, like I’m exponentially delay

the runs of my algorithm, a mile ecstatic algorithm, and its perimeter beta or gamma tells me the base

of that exponent. So, if I have gamma equals one, then it’s like every time

my database grows by size n. If it’s- if it is zero, then it’s linear growth, if it’s one then it’s exponential growth and so this gamma

can sort of tune. How quickly do I want to rerun this algorithm as

a function of my growth? So, I have this like

a produced alpha stream, y_i, and I’m just going to use

that database to respond to all the queries that arrive until I rerun Smalldb the next time. Come on. Okay. I want a key is this exponential delay and the optimal tuning of an gamma. So, doing this allows us to work in a growing

databases setting. And things are basically

the same as before, but I now have

this one-fifth here instead of a one-third power before. So, this is a concrete example. Who wants to see

the real thing? Cool. No one. Yes. Booo. Okay.

Huh? Lock the doors. Okay. So here’s the whole thing. I want to walk you

through each step, but the key idea is that I’m going to take a PG black-box. I’m going to rerun it at times that are

exponentially spaced out. I’m going to get

some accuracy guarantee that has this loss, one over two p plus one in power compared to the same problem

in static setting. Because before, I had

this PG black-box, and so it was G times

this stuff to the power, p and I just lost in the

power of my accuracy. That’s it. Okay. So, these

things, what these all mean? So, the gamma is

a base of my exponent, and nodes that it is

independent of the data growth, and so this is just

fixed over time. This is the exponent and

the delay of how often I should rerun by fixed private

accurate static algorithm. The Epsilons, we tune them

at each stage to make sure that our composition

gives us that adds up to our overall

Epsilon guarantee. So, we have some desired

Epsilon and I just tune this to make sure that

if I sum for all times, this is going to converge

to be just Epsilon. The alpha at each time, t_i, is going to be this. This is just

the accuracy guarantee. You have a static algorithm. So, notice if I have

a PG black-box, this is just the alpha that would happen naturally if I run

the algorithm at time, t_i, and I want to do the same thing for

my betas as I’m going to have this independent failure

probability at each round. I just want my Thetas to

converge to some overall data, so that I can take a union bound over their

failure probability. Although there’s a lot

of things happening, it’s actually very simple. I’m taking a fixed

static algorithm and I’m just going to rerun it at times that are appropriately spaced out in terms

of my data growth. Some extensions of this in

the paper are that we give this Epsilon Delta

differentiate private version. So, here again, we have

this improved accuracy at the cost of a slightly

weaker privacy guarantee. We also give

a different algorithm in the same spirit that now has an accuracy guarantee

that improves over time. So, this says that my alpha

is not just some constant, but in fact, my accuracy is going to approach perfect as

my database grows large. So, this is appropriates

who you think of problems for your data are

going to be sampled IED, and you expect as you have that’s arbitrarily large sample, then you should be

doing arbitrarily well. So, this notion is

more appropriate for those type of problems. This alpha, T, starts out weaker than the fixed

accuracy guarantee, but it becomes stronger once the database roughly

squares in size. We also extend how we apply

this algorithm to the case of empirical risk minimization

where it makes sense that if you

have a larger sample, then you should be doing better. What time am I supposed to end?>>1:30.>>Okay. Okay, So then, I will quickly just

state the results here. I’m not going to give

the ERM background. So, if you know what

this will be cool, if you don’T, then you can

take a nap for 30 seconds.>>So, why is this awards like the high level intuition of the previous result, like is it->>For this one?>>The other one. The theorem is

stated like you’re-.>>This.>>Yeah.>>Yeah.>>For, I use it->>So, the takeaway is that, is that really the loss and

the accuracy for extending a static algorithm to the growing case just

depends on this power. And so this is like

the only loss that you get. It just depends on the original alpha

that you start with. This can also go on forever. So, here as long as

your database continues to grow, then you can continue to

answer queries about it. And you can still

maintain the same->>Because the way I understand your algorithm is

that you are running this black-box that you covered them at certain chosen

intervals, right? Once you fix gamma,

these intervals are also like opposite, like maybe exponentially

growing size.>>Yeah.>>Then are you just going to apply composition

on it start like, apply competition on it or is it no?>>Yeah, yeah, yeah because the static algorithm says

pull up high probability. I’m going to be

approximately accurate. And so as long as you

can take a union bound over those high probability

events forever, then you’ll still have

the same high probability event. They’re going to be

correct for all queries forever as long as you continue

to accumulate more data.>>Sure. Okay.>>Yeah, yeah, and also assumes that we know nothing about how

this algorithm works. They don’T have to know

specifics about it. I’d just say, hand

me something that is accurate and private, and boom, it now works

in a static setting. Yeah.>>What if your database is

also growing and shrinking? If there’s data points that

are now being removed, this is like

a real world constraint that we see sometimes, right? Is that flexible? Like if you gained 10 data

points but you lose three.>>Maybe you can let

her finish her talk.>>Yeah.>>I’m sorry.>>Yeah. But that’s actually a very good question

and in two slides, I will talk about that

to my open questions. Because that’s very, very real and we acknowledge that this is one very specific type of growth, but there’s lots of other types of changes that may happen.>>Sorry to interrupt.>>Yeah. Sure. That’s fine. Anyways, this is my last

like technical slide. So, I’ll just sort of

put things up there. So, for applying this algorithm where the accuracy guarantee

improves over time. Here’s what we get for

our empirical risk. So, here we have to have Epsilon Delta

differential privacy to get meaningful composition. So, if you have D-dimensional samples

with high probability for all times T, you’re going to

output a classifier that has excess

empirical risk of this. And comparing that to the static case for had

Epsilon differential privacy, and this is our black box that we use because we apply this and we just rerun this

every single time. So, our like dependence on

D and beta is the same, if we have some loss

because of our delta. How we have on the bottom here, we have like root T

compared to their T, and that’s like the only loss. So, this says if you want to do here I’m privately on

a growing database, this works. Pick your favorite

static ERM algorithm and it will work in

this environment too. And of course, like once

your T grows large, then you’re going to

be doing very well. In the paper we also get stronger bounds if you have more assumptions on

the loss function like, if it’s strongly convex then

you can do better as well. Okay. So, the open questions and I think

that’s actually really important because this is one very particular

type of growth, but there’s a lot of other practical types of

growth as you pointed out. Here are various questions

that I think can follow from this that are

basically completely open. How can you do better if you have additional assumptions

on the data? So, here we assumed adversarial data growth

in the worst-case, maybe data points are

well sampled IID. Then, you should be

able to do better. How much better? I don’t know. Maybe they aren’t

quite stable with IID, but they’re sampled from some smoothly changing

distribution. So, things that you have

learned in the past are still like mostly correct and you may want to adapt

them a little bit more to accommodate for

these smooth changes. How should you do

that? Don’t know. You also might want to identify when you have these like dramatic

changes in your data. So, I have a follow-up paper

with myself, Sara Kreeble, Yejin May, Tora Ray, and Ron Sing, where we do a differentiate private

change point detection. Of saying like, “I

want to identify. Whoa, the samples I’m seeing now are dramatically different from things I saw in the past.” So, this sort of like pairs

nicely with this how it’s saying it’s not

changing smoothly, but it’s changing once

and quite dramatically. You may also want to do

some sort of like AB testing. As your points arrive online, you want to privately say, “I think my A treatment and B treatment are the

same or they’re not.” It’s also like lots

of other types of growth that may happen. For example, he may want to

have sound like changing data points that

can churn or can be deleted or like removed, and that is not all

accounted for in this paper, but that’s a very practical

and very real type of type of change that we see. There’s also this case of

a horizontal Theta growth. I mean if you view

data as a matrix. How I have like a row

for each person? Maybe, I’m not just like

accumulating new rows, as in this paper, but I’m accumulating new columns

about the same people. That’s also not

accounted for here, but there’s also like very

real and very practical. So, I think these questions are completely open and

very important in terms of how to do them

privately. That’s all, thank you.>>So, can we have

more questions if you guys want. Also, Rachel is here

until Friday so if you want to call me

for some one on one, shoot me an e-mail and

I’ll set something up. More questions now?>>We know lower boundaries in a growing database has to be harder than the full data

basically, yes?>>It actually doesn’t have to be in this particular case, and that’s how this case

is dramatically different from these two cases. It really boils down to

this sensitivity of each query, of being of size, of being of value, one over the size

of the database. So, as you get a larger

and larger database, each person’s data

point matters less, and so, you can sort

of like naturally answer more queries

on a larger database. So, in some sense

that’s really at the crux of all of

these results is we’re saying, “As long as your data grows faster than your queries arrive, then you’re fine

forever.” That’s->>Can you compare to the data. This is, you don’T go forever.>>Yeah.>>When compared to

a static size database->>Oh, yeah.>>-then you less

than had the all the things in the beginning?>>Yeah. Yeah. Yeah, of course.>>These guarantees

are worse than what you get in that case.>>Yes.>>The question is

do they have to?>>They have to be, yes. Because if you’re going to wait until everything arrives and you’re just going to

answer queries then, then you’re only

going- Oh, no. Yeah.>>I will answer. Right,

I do answer them online. The question is, does that mean->>Compared to.>>Yeah. Compared.>>Yeah and if you wait until the very end until you

have everything arrived, and you just then

answer all the queries, that’s saying that

your privacy budget is going to be much lower

compared to if you had to answer

these queries a lot of times in the middle

and at the very end. There is like a natural

privacy accuracy trade-off. How the more queries you answer and the more privacy

budget you burn through, necessarily then your accuracy has to be worse to compensate. So, you see a lot of these->>In the beginning you’re like the sensitivity of

the queries is also larger. Like if there is only one person, then of course if you

wanted to answer it then what became changes

much higher then. So, initially it rose rapidly

and then it sort of like.>>Yeah. Yes. Well, the first best is just to

wait until you have all the data in the world and you just answer queries then. But of course, that’s sort

of useless as an analyst because you want information

in the meantime. So this is saying, “You’re doing not too much

worse if you are able to continually

answer queries in the meantime in addition

to at the very end.”>>This model is

very similar to like dynamic algorithms,

exactly there.>>Yeah. Yeah. A lot of the things in the privacy space they use the term “Dynamic”, they mean, the queries

arrived online. So, it’s like adaptively arriving queries for a fixed

database and this is more or less

the first work that considers dynamically

changing database.>>Another thing would be nice, just I think the second point, is that only when actually the solution changes

I will sort them? How many times, my action,

my Instagram changes?>>Yes. Yeah.>>So, that we can

quantify in terms of that.>>Yeah. Yeah.>>That’s your second point.>>So, I’ll say

this paper just said how to find out when

a change occurred, but it doesn’T combine

that with these results, which I think that is also

open is like saying, “So, I now know that I have

different data and I’m only then going to rerun

these algorithms.” I think like quantifying how much better that is over this case is sort of tied

into all these questions.

She's adorable. Nice talk

Another typical microsoft talk, interesting content, hopeless camera work