[Background music] In today’s highly connected society, we are constantly being asked to provide personal
information to retailers, voter surveys, medical professionals, and other data collection efforts.
For the purposes of research, we provide our personal information.
We disclose our name, age, sex, address, social security number, medical information, purchasing
history, political affiliation, and more. All of these variables are contained in large
databases and form enormous collections of microdata.
This microdata is used by statistical agencies to facilitate research in fields such as public
health, economics, and sociology. In order to use this information to its fullest,
the organizations that collected this data often share microdata with other agencies
for the sole purpose of statistical analysis. In the U.S., privacy of microdata is protected
by the Confidential Information Protection and Statistical Efficiency Act of 2002.
This means that before an agency disseminates our microdata it must be altered in some way
to ensure an individual cannot be identified. Simply removing key identifiers such as names
or addresses is often not enough to protect our identities.
For example, in a small-town community, everyone in the community may know that the only American
Indian who lives there is named John Doe and he is 50 years old. So stating that a particular
health record came from a 50-year-old American Indian male lets everyone know that it is
John Doe, even if his name is not on the health record.
Therefore, agencies may also need to alter values of sensitive attributes to ensure confidentiality
and public trust. All public use data that is released undergoes
some type of Statistical Disclosure Limitation method.
For example, data can be grouped into aggregated categories, such as age or location categories.
More extreme alterations include sexes being switched and numerical noise being added to
the data set. The degree to which data is altered varies
from one case to another. The more the data is altered, the lower the
risk of disclosure. There is of course a tradeoff. As the data
is increasingly altered, the inferences that can be made from this data become less accurate.
And so there is a delicate balance between disclosure risk and data utility or data quality. So how do these agencies decide what methods
they will use to alter the data without compromising interpretations and still maintaining confidentiality?
Although there are few hard principles to guide this decision, there are a number of
complex modeling strategies that can be used to predict the level of confidentiality maintained,
depending on the nature of the data. There are also a number of important points
to consider when applying these strategies. Are the data randomized?
Or are these observational studies that compare data from existing databases?
Is there missing data? Are there many possible outcomes or just 2?
Is this categorical data? Is the aim to determine causality? The answer to each of these questions will
affect the alteration strategy to be applied. An area of current research and development
aimed at protecting data confidentiality is the use of synthetic data.
This strategy actually uses simulated data to represent any sensitive information.
The use of synthetic data has been applied in the Census Bureau’s Survey of Income and
Participation to gauge the effectiveness of public-assistance programs.
And the use of synthetic data is catching on.
As new strategies for synthetic data are developed and tested, our microdata and our identities
are becoming increasingly safer. [Background music]