Protecting Privacy with MATH (Collab with the Census)

Protecting Privacy with MATH (Collab with the Census)


Disclaimer: This video was produced in collaboration
with the US Census Bureau and fact-checked by Census Bureau scientists; any opinions
and errors are my own. Every ten years the US Census Bureau surveys
the American population – the ambitious goal is to count every person currently living
in the entire United States of America and collect information about them like age, sex,
race and ethnicity. The whole purpose of doing surveys like the
census (and many other big medical or demographic surveys) is to be able to get an overall,
quantitative picture of a particular population – how many people live in Minnesota? Or Mississippi? What’s their average age? And how do these things differ in different
places, or by sex, or race? The results of the US Census are of particular
political relevance since they’re used to determine the numbers of seats that different
states get in the US House of Representatives as well as the boundaries of legislative districts
from Congress down to city councils, but big surveys are also useful for understanding
lots of other issues, too . The problem, of course, is that the Census (like many other
medical and demographic studies) is supposed to be private. Like, no one outside the Census Bureau is
supposed to be able to look at just the published statistics about the US population demographics
and definitively figure out that there’s a white married male 31-year old with no kids
living in my neighborhood (that’s me). The census bureau is supposed to keep my information
confidential. And they’re supposed to keep the information
of every single other person living in the United States confidential, too. Which is a tall order, because how can you
keep everyone’s information entirely confidential while still saying anything at all based on
that information? The short answer is that you can’t. There’s an inherent tradeoff between publishing
something you learn from a survey and maintaining the privacy of the participants. It might seem like you could just remove people’s
names from the spreadsheet, or only publish summaries like averages and totals. But it’s easy to reconnect names to datasets
using powerful computers, and there’s a mathematical theorem that guarantees that
if you do a study, every single piece of accurate information that you release, however small
it seems, will inherently violate the privacy of the participants to some degree violate
the privacy of the participants in that study to some degree. And the more information you publicly release,
the more you violate the individual privacies of the participants. But how do you quantitatively measure something
nebulous like loss of privacy, and then how do you protect it? To understand how to measure privacy, it’s
helpful to start by imagining how somebody would try use published results (from a study)
and piece together the private information of the people surveyed. They could just try to steal or gain direct
access to the private information itself , which, of course, can’t be protected against mathematically
– it requires good computer security, or physical defenses, so we won’t consider it here! The kind of privacy attack we can defend against
mathematically is an attack that looks at publicly published statistics and then applies
brute force computational power to imagine all possible combinations of answers the participants
could have given to see which ones are the most plausible – that is, which ones fit the
published statistics the best. Imagine checking all possible combinations
of letters and numbers for a password until one of them works, except instead of letters
and numbers it’s checking all possible “combinations-of-the-answers-that-330-million-people-could-give-on-their-census-questionnaires” to see which combinations come closest to
the publicly published figures for average age, racial breakdown, and so on. The more closely a potential combination of
answers matches the published figures , the more promising a candidate it is (from the
attacker’s perspective). The more poorly it matches, the lower their
level of certainty. As a small example, if there are 7 people
living in a particular area and you tell me that four are female, four like ice cream,
four are married adults, three of the ice cream lovers are female, and if you also give
me the mean and median ages for all of these categories, then I can perfectly reconstruct
the exact ages, sex, and ice cream preference of everyone involved. I would start with the 3 ice cream loving
females; even though there are hundreds of thousands of possible combinations of ages
for three people, only a small fraction of those – 36, in fact – are plausible – they’re
in the right combination to give a median age of 36 and a mean age of 36 and two thirds. And the same thing works for the four females
overall – there are almost 10 million possible combinations of ages they could have , but
only 24 age combinations that are consistent with a median of 30, a mean of 33.5, AND with
at least one of the plausible age combinations for the three ice-cream lovers. Continuing on with this kind of deduction
leads to a single plausible (and perfect) reconstruction of all of the ages, sexes,
and ice-cream preferences of the people involved; a 100% violation of privacy. If, however, you didn’t list how many of
the ice cream lovers were female, there would instead be two plausible possibilities, so
I would be less certain which was the true combination of ages and genders and ice cream
preferences. And the potential level of certainty of an
attacker is precisely how we measure the loss of privacy from publishing results of a study. If all possible combinations of ages and sexes
and so on are similarly plausible, then an attacker can’t distinguish between them
very well and so privacy is well protected. But if a small number of the possibilities
are significantly more plausible than the rest, they stand out – and precisely because
they stand out on plausibility, they’re also likely to be close to the truth. So to protect privacy, all possibilities need
to seem similarly plausible, or at least there can’t be plausibility peaks that are too
conspicuous. The potential for plausibility peaks is quantified
mathematically by measuring the maximum slope of the graph – if the slope never gets too
steep, then you can’t have any sharp peaks of highly plausible possibilities that stand
out.But how do we publish statistics in a way that limits the maximum slope (and possible
peaks) on the plausibilities plot? In practice, the best way to limit an attacker’s
ability to confidently choose one scenario over the other is to randomly change, or “jitter”,
the published values. Like, for example, rolling a die and adding
that number to the average age reported for ice-cream lovers. Jittering the published results in a mathematically
rigorous way puts a limit on the slope of the plausibility graph, and thus makes it
harder for any particular possibilities to stand out above the rest. Jittering results might also seem like lying,
but as long as the size of the adjustment isn’t big enough to make any significant
changes to conclusions people draw from the survey, then it’s considered worth it for
the privacy protection. For example, imagine I want to give you a
sense of my age while keeping my true age secret. If I just told you my age, obviously there’s
just one plausible possibility – 31! But suppose instead that I secretly pulled
a number between minus 5 and 5 out of a hat and added it to my age before telling you
. In this case, all you know is that my true age is somewhere within 5 years of the number
I told you, but you don’t know my age exactly. My privacy has been preserved, though only
to a certain degree because you can be confident I’m not 20 and not 40. To protect my age more, I’d have to pull
a number between, say, -10 and 10 out of a hat and add it to my age – this increases
the number of plausible possibilities – that is, the possible true ages that COULD have
resulted in the number I told you. It also increases your uncertainty about my
actual age – the tradeoff for privacy is inaccuracy. If I wanted you to know my age within a year,
I could only pull a number between -1 and 1 out of the hat.In general, the idea is this:
more privacy means you get less accuracy . Less privacy means you can have more accuracy . When
you publish results, hopefully there’s a sweet spot where you can share something useful
while still sufficiently maintaining peoples’ privacy. And simultaneously maintaining decent privacy
and decent accuracy gets easier and easier with larger datasets. Like how as I add more noise to this image,
you can still get the general picture even once you’ve lost any hope of telling the
true original value of a particular pixel. So, to protect people’s privacy, we can
and should randomly jitter published statistics (which the US Census, for example, has been
doing since the 1970s). However, there’s a subtlety – you can’t
just add any old random noise however frequently you want – if I simply add different random
noise to this picture a bunch of times different times, once you take the average of all of
the noisy images you basically get back the original clean image – you don’t want this
happening to your data. So, there’s a whole field of computer science
dedicated to figuring out how to add the least possible amount of noise to get both the most
privacy and the most accuracy, and to future-proof the publication of data so that when you publish
multiple jittered statistics about people, those statistics can’t be combined in a
clever way to reconstruct peoples’ data. But up through the 2010 census, the Census
bureau couldn’t promise this – sure, they were jittering data published in census bureau
tables and charts, but not in a mathematically rigorous way, and so the Census bureau couldn’t
mathematically promise anything about how much they were protecting our privacy (or
say how badly it’s been violated). Until now! The US 2020 Census will, for the first time,
be using mathematically rigorous privacy protections. One of the biggest benefits of the mathematically
rigorous definition of privacy is that it reliably compounds over multiple pieces of
information – like, if we have a group of people and publish both their average age
and median age, each with a privacy loss factor of 3, then the privacy loss factor for having
released both pieces of information is at most 6. So you can decide on a total cumulative amount
of privacy loss you’re willing to suffer , and then decide whether you want to release,
say, 10 pieces of information each with 1/10th that total privacy loss (and less accuracy),
or if you want to release 1 piece of information with the full privacy loss and a higher level
of accuracy.But how much privacy we need is a really hard question to answer. First, it involves weighing how much we as
society collectively value the possible benefits from accurately knowing stuff about the group
we’re surveying vs the possible drawbacks of releasing some amount of private information. And second, even though those benefits and
drawbacks can be mathematically measured as “accuracy” and “privacy loss”, we
still have to translate the mathematical ideas of “accuracy” and “privacy loss” into
something that’s understandable and relatable to people in our society. That’s partly a goal of this video, in fact! So let’s give it one more shot at a translation.First
and foremost: it is in principle impossible to publish useful statistics based on private
data without in some way violating the privacy of the individuals in question. And if you want to provide a mathematically
guaranteed limit on the amount of privacy violation, you have to randomly jitter the
statistics to protect the private data.The accuracy of the information after being jittered
is generally described probabilistically, by saying something like “if we randomly
jittered the true population of this town a bunch of times, 98% of the time our jittered
statistic would be within 10 people of the true value.” So accuracy has two components: how close
you want your privacy-protected statistic to be to the real answer , and how likely
it is to be that close. The loss of privacy due to the publication
of information is described in terms of how confidently an attacker would be able to single
out a particular possibility for the true data the plausibility of different possible
true values for the underlying data. Given the published information, are there
just a few possibilities for the true data? Or are there many, many, plausible possibilities
for what the true data might be? Essentially, loss of privacy is measured by
the prominence of peaks on the plausibility plot. And so the protection of privacy requires
policing the possibility for such peaks. If we individuals are going to willingly participate
in scientific or other studies and surveys or use services where we reveal potentially
sensitive personal information, we should really demand that the researchers or organizations
utilize a mathematically robust way of protecting our privacy. Simply put, if they can’t guarantee there
won’t be a peak in plausibility, then we shouldn’t agree to give them a peek at our
data. SPONSORSHIP MESSAGE
Thanks to the U.S. Census Bureau for supporting this video. The founders of the US understood that an
accurate and complete population count is necessary for the fair implementation of a
representative democracy, so a regular census is required by/enshrined in the US Constitution. The US 2020 Census will be the first anywhere
to use modern, mathematically guaranteed privacy safeguards to protect respondents from today’s
privacy threats. These new safeguards will protect confidentiality
while allowing the Census Bureau to deliver the complete and accurate count of the nation’s
population. They will also give those who rely on census
data increased clarity regarding the impact that statistical safeguards have on their
analyses and decision-making. In short, the Census Bureau views the adoption
of a mathematical guarantee of privacy as a win-win.Here’s how the chief scientist
at the Census bureau thinks about it: there is a real choice that every curator of confidential
survey data has to make. If they want the respondents to trust them
to protect confidentiality, then the curator has to be prepared to give (and implement)
mathematically provable guarantees of privacy. Unfortunately, this means there’s a constraint
on the amount of information you can publish from confidential data. It’s mathematically impossible to provide
perfectly accurate answers for as many questions or statistics as you want while also protecting
the privacy of respondents. So curators need to do two things: understand
the needs and desires of the people who provided data and the people who want to use the data
in order to determine precisely what balance of accuracy vs privacy to choose, and then
not waste that limited privacy budget by publishing accurate answers to unimportant questions.

100 Replies to “Protecting Privacy with MATH (Collab with the Census)”

  1. I think it would be helpful to describe what we know about these predators. Try and figure what the real dangers are from the less private options. What are these predators going to target and what are they going to try and do with their targets?

  2. I have a question. What if you do not get NAME, ADDRESS etc for statistics which doesn't need anything nonstatistical info like these. You won't need to protect privacy if there's nothing "private" (here: connecting to the definite you)

  3. Published data is privacy preserved. But the raw data and people involved are still sources of breach. Raw data can be encrypted and locked away. What about people especially those who know the "jittering factors"? If you have published data, and "jittering factors", can raw data be recovered accurately?

  4. minutephysics, CGP Grey, and Fermilab all upload in one day. Plus, lootboxes get labeled as gambling like.
    Is this a dream?

  5. why would anyone care if their race, sex and age were public? it's nothing you can change really, and it's true, so what would be the problem?

  6. As someone who actually publishes differential privacy research, I just would like to mention that a privacy budget of 30 is absurdly high; there are cases were a privacy budget of 30 would allow you to reconstruct someone's data with over 95% accuracy. On our research team, we would never consider a privacy budget above 5, and the gold standard was .01.

  7. Anybody know what is the reason for having this kind of censuses at all? It seems wild to me that a government doesn't know who are living in a country in the first place. In my country every single citizen has a social security number, so the government knows without asking all the relevant information about each citizen. Why doesn't the US know their own citizens, and has to use huge amount of resources to recount them periodically?

  8. The thing is that in nearly any given community epsilon is soooo small as to be practically unusable. I worry that this video will leave many thinking that we're dealing with epsilons that have useful value beyond being intellectually stimulating. I'm glad we're implementing matrix smoothing, but for everyone else who doesn't understand the math, there's nothing for you to worry about and you should really complete your census. Us nerds are busy battling privacy concerns long long before they should be of concern for people who aren't mathematicians.

  9. Gerimandering is done based on the race info. And guess who does the gerimandering? Don't give your race info to the party that keeps you out of power. Racism will remain until you stop believing in races. You're human! Member of the human race!

  10. Does the average person even know the value of his private information or is it just a case of "I don't like something of mine being tampered with just because it is mine"
    Privacy is weird. People want their information to be private yet they happily talk about their private information whenever they are introducing themselves and get to know strangers.

  11. Who the heck wants to spend resources on figuring out how many people this and how many pensioners that etc? What do you do with the data anyway? Are people really that paranoid?

  12. I think it's all a ruse: they probably throw the submitted census forms into the trash upon receiving them, and then remove the stamp from the envelope and sequence the DNA on it. – j q t –

  13. What if data is jittered locally (say the census is done using an app) at the time of gathering it without ever registering the exact data?

  14. ya that is why the census I fill out reads more like a novel … I give the right number in the house hold … every thing else should be considered suspect … no amount of math will never prove any thing on my report

  15. The census is a very important tool for keeping our government fair and functioning, and even though it’s been politicized lately, I’m really glad people like Henry are talking about it. Good policies start with good data.

  16. Couldn't you also occasionally publish the actual data because if you say you adjust the data wouldn't let's say 1 in 50 times, wouldn't that make it harder to crack since you always assume it isn't adjusted?

  17. Wait… couldn’t you buy data from Facebook, Google, and Twitter,
    Compare them all to each other,
    Then compare that data to the census to try and figure out the jitter?

  18. This is all very interesting, however a couple questions comes to mind, what is the point? What damage does this loss of privacy cause?

  19. And if all the data does not show statistical spikes to begin with, seemingly all things being the same, the jittering can determine the new political boundaries and new districts. Privacy vs. Voting power? A dangerous compromise.

  20. The thing is, the actual results of the census are published after 72 years. This means the actual results from the 1950 Census will be released in 2022. So the results of the 2020 Census will be released in 2092.

  21. Cool math and all, but isn't this a bit excessive? The prevalence of social media has told us that most people either don't care about their privacy or at least hasn't experience negative effects in losing their privacy. Is it necessary to prevent some nerd from wasting a ton of compute time to obtain a data set too massive and useless for anything malicious, when people just post this stuff on facebook anyways? This feels more like an attempt to protect some theoretical ideal of privacy than anything practical.

  22. I would like to see you do a video on attempting an actual count, v. using a more accurate sampling survey. Since this is, unfortunately, a political issue, I didn’t expect it to be mentioned in a video sponsored by the Census Bureau.

  23. Perceiving Physics Persons Percolated Preparations Postulating Possible Probablistic Privacy Problems Passified Personal Passion Pertaining Principled Privacy Process.

  24. But isnt my privacy still somewhat protected? The way i understood, the algorithm only knows how many of what kind of people exist, like 2 female ice cream lover, etc., and not whos name is actually behind that statistc. Wouldnt information like that be useless?

  25. So does the census bureau take into account all the information you can already find out about most people on the web? With accurate values for many of the question for most people, it is MUCH easier to find plausable values for the remaining people. Of course this data means that most people have very little privacy to begin with.
    If you don't believe me, look youself up. There are several sites that have detailed information on most people. Check any one of them and be very afraid! Then notice the completely wrong values scattered in and be even more afraid.

  26. It is lying. And it will affect the results if you change it. What's the sense of a census if you'll taint the results anyway? No wonder statistics are unreliable.
    Afterthought: Why not just don't tell anyone, except close relatives or the people doing the census or other companies requiring it, your age or birthdate? That way, no one will know how old are you and can't have any data be traced back at you. Just stop giving out private information to strangers, coworkers, drinking buddies, "friends" etc.

  27. I'm wondering when this channel stopped to create content corresponding to it's name. It's neither physics nor minute (or even close to it :p)

  28. Instead of adding jitter, why not just decrease the precision of the numbers release? Jitter is misleading and can lead to people making false conclusions based on the data, while a decrease in precision would likely have the same if not more drastic increase in privacy?

  29. Yay! My favourite channel, 12minutemaths has uploaded a new video!

    (I do actually love the video, I’m not hating, I just think it’s funny.)

  30. The summary at the end of this video is absolutly wonderful. the whole video was great but dense and the summary helps contain it all.

  31. Could you not just round it to a reasonable number of digits? 37 gives less information then, but is still truthful if, the value is 36 2/3

  32. It seems silly to me to consider your race or age "private" when both can be estimated purely by appearance and the former is a political category.

  33. Great video! Thanks
    Question: Do you know if these models take into account the noise inherently present in any data collection, especially on this scale?
    I.e., assuming even zero perturbation of the data, the knowledge you get from it is still a proxy of the real information (due to human errors, intentional misinformation, etc.), so taking that into account might give you some leeway in your "privacy budget". Maybe this can be modeled as an increase in budget without harming privacy?
    Just a thought 🙂

  34. Accurate census and then fair voting districts, please. The level of Gerrymandering we have today is unacceptable. I Think there should be some national level agreement on what are the best practices for making fair voting districts.

    I guess that even the term "fair" is open to interpretation if you are trying to hold on to power.

    That just may be a good idea for a video, or has Grey already done that video? If so, surely it was so long ago that you could do one to simply address some more current concerns.

  35. This is one reason the IRS breaks the law, all information on tax payers are to be private. As far as census it is only to count citizens for political reasons.

  36. Interesting ideas, but I can't picture privacy being a problem when the census has MILLIONS of data points. Sure, we have hundreds of variables made publicly available, but what can you possibly reconstruct from millions of averaged data points?!

  37. Explain to me why anybody other than the government would possibly care. Who cares if their are exactly a certain amount of men or women in a given area. Or if they are married or what ever. Nobody in their right mind would even use this information for mischievous acts for the simple fact that their are a thousand easier ways to steal from people. What in the hell is the point?

  38. But you preserve privacy if all the information is stored on different tables. Instead of having a table with gender age job favourite color. You can have a table with age and gender
    Then a table with age and fav color
    Then a table with gender and fav color

    Basicly you're unrelating information and you, company that made the survey decide which tables are more useful by the point of view of statistics of others.

  39. In truth the only true way to have privacy is if you stay in your house and use no electric devices that can be scanned or in any other way read from outside and never go outside and have to be somewhere where no one knows you or knew you before could identify you. But this leads to the problems of how to pay bills, how to obtain food or clothing, how to have a water supply that’s not being monitored, how do you light the interior without having an electronic meter outside or having candles delivered seeing as windows allow others to see in. As a single individual you could go to some very remote area build a cabin grow food or hunt for food make your own clothing from what you grow or hunt find a water source be a hermit. The problem is there’s not enough room for everyone to do that. A better way to start protecting peoples privacy would be to update our Social Security cards from 1820s technology start using smart ships that include information like year of birth so that no one could just take your ID without being The right gender the right age group in the right location and have enough knowledge of family history and other places family may live and work history. The idea that somehow no one can see you know who you are know what your gender is what you like to buy what style clothing you wear and somehow be protected is foolishness the second you step outside of your house you give up your privacy. You are now in public you can be seen you can be heard you can be observed. Our founding fathers had a better understanding of this but some foolish people have decided that no one that I don’t want to know anything about me shouldn’t ever be able to know anything about me and that’s impossible. A good example I’m your brother it’s your birthday you don’t want anyone to know I walk in and say happy birthday bro everyone now knows who is in this area and could hear me or who talks to anyone of those people. We need to start understanding the difference between private and public life again.

  40. Next thing you know, the supreme court will rule this unconstitutional because "using math to count things bad" (see this: https://www.law.cornell.edu/supremecourt/text/525/326)

Leave a Reply

Your email address will not be published. Required fields are marked *