Jan 29, 2014

Privacy in the Age of Open Data


Nothing is really private anymore. Corporations like Facebook and Google have been collecting our information for some time, and selling it in aggregate to the highest bidder. People have been raising concerns over these invasions of privacy, but generally only technically-savvy, highly motivated people can really be successful at remaining anonymous in this new digital world.

For a variety of incredibly important reasons, we are moving towards open research data as a scientific norm – that is, micro datasets and statistical syntax openly available to anyone who wants it. However, some people are uncomfortable with open research data, because they have concerns about privacy and confidentiality violations. Some of these violations are even making the news: A high profile case about people being identified from their publicly shared genetic information comes to mind.

With open data comes increased responsibility. As researchers, we need to take particular care to balance the advantages of data-sharing with the need to protect research participants from harm. I’m particularly primed for this issue because my own research often intersects with clinical psychology. I ask questions about things like depression, anxiety, eating disorders, substance use and conflict with romantic partners. The data collected in many of my studies has the potential to seriously harm the reputation – and potentially the mental health – of participants if linked to their identity by a malicious person. This said, I believe in the value of open data sharing. In this post, I’m going to discuss a few core issues as it pertains to de-identification – that is, ensuring the anonymity of participants in an openly shared dataset. Violations of privacy will always be a risk: However, some relatively simple steps on the part of the researcher can make re-identification of individual participants much more challenging.

Who are we protecting the data from?

Throughout the process, it’s helpful to imagine yourself as a person trying to get dirt on a potential participant. Of course, this is ignoring the fact that very few people are likely to use data for malicious purposes … but for now, let’s just consider the rare cases where this might happen. It only takes one high-profile incident to be a public relations and ethics nightmare for your research! There are two possibilities for malicious users that I can think of:

  1. Identity thieves who don’t know the participant directly, but are looking for enough personal information to duplicate someone’s identity for criminal activities, such as credit card fraud. These users are unlikely to know anything about participants ahead of time, so they have a much more challenging job because they have to be able to identify people exclusively using publicly available information.

  2. People who know the participant in real-life and want to find out private information about someone for some unpleasant purpose (e.g., stalkers, jealous romantic partners, a fired employee, etc.). In this case, the party likely knows (a) that the person of interest is in your dataset; (b) basic demographic information on the person such as sex, age, occupation, and the city they live in. Whether or not this user is successful in identifying individuals in an open dataset depends on what exactly the researcher has shared. For fine-grained data, it could be very easy; however, for properly de-identified data, it should be virtually impossible.

Key Identifiers to Consider when De-Identifying Data

The primary way to safeguard privacy in publicly shared data is to avoid identifiers; that is, pieces of information that can be used directly or indirectly to determine a person’s identity. A useful starting point for this is the list of 18 identifiers indicated in the Health Insurance Portability and Accountability Act that are to be used with Protected Health Information. A full list of these identifiers can be found here. Many of these identifiers are obvious (e.g., no names, phone numbers, SIN numbers, etc.), but some identifiers are worth discussing more specifically in the context of psychological research paradigm which shares data openly.

Demographic variables. Most of the variables that psychologists are interested in are not going to be very informative for identifying individuals. For example, reaction time data (even if unique to an individual) is very unlikely to identify participants – and in any event, most people are unlikely to care if other people know that they respond 50ms faster to certain types of visual stimuli. The type of data that are generally problematic are what I’ll call “demographic variables.” So things like sex, ethnicity, age, occupation, university major, etc. These data are sometimes used in analyses, but most often are just used to characterize the sample in the participants section of manuscripts. Most of the time, demographic variables can’t be used in isolation to identify people; instead, combinations of variables are used (e.g., a 27-year old, Mexican woman who works as a nurse may be the only person with that combination of traits in the data, leaving her vulnerable to loss of privacy). Because the combination of several demographic characteristics can potentially produce identifiable profiles, a common rule of thumb I picked up when working with Statistics Canada is to require a minimum of 5 participants per cell. In other words, if a particular combination of demographic features yields less than 5 individuals, the group will be collapsed into a larger, more anonymous, aggregate group. The most common example of this would be using age ranges (e.g., ages 18-25) instead of exact ages; similar logic could apply to most demographic variables. This rule can get restrictive fast (but also demonstrates how little data can be required to identify individual people!) so ideally, share only the demographic information that is theoretically and empirically important to your research area.

Outliers and rare values. Another major issue are outliers and other rare values. Outliers are variably defined depending on the statistical text you read, but generally refer to extreme values when variables are using continuous, interval, or ordinal measurement (e.g., someone has an IQ of 150 in your sample, and the next highest person is 120). Rare values refer to categorical data that very few people endorse (e.g., the only physics professor in a sample). There are lots of different ways you can deal with outliers, and there’s not necessarily a lot of agreement on which is the best – indeed, it’s one of those researcher degrees of freedom you might have heard about. Though this may depend on the sensitivity of the data in question, outliers often have the potential to be a privacy risk. From a privacy standpoint, it may be best for the researcher to deal with outliers by deleting or transforming them before sharing the data. For rare values, you can collapse response options together until there are no more unique values (e.g., perhaps classify the physics professor as a “teaching professional” if there are other teachers in the sample). In the worst case scenario, you may need to report the value as missing data (e.g., a single intersex person in your sample that doesn’t identify as male or female). Whatever you decide, you should disclose to readers what your strategy was for dealing with outliers and rare values in the accompanying documentation so it is clear for everyone using the data.

Dates. Though it might not be immediately obvious, any exact dates in the dataset place participants at risk for re-identification. For example, if someone knew what day the participant took part in a study (e.g., they mention it to a friend; they’re seen in a participant waiting area) then their data would be easily identifiable by this date. To minimize privacy risks, no exact dates should be included in the shared dataset. If dates are necessary for certain analyses, transforming the data into some less identifiable format that is still useful for analyses is preferable (e.g., have variables for “day of week” or “number of days in between measurement occasions” if these are important).

Geographic Locations. The rule of having “no geographic subdivisions smaller than a state” from the HIPAA guidelines is immediately problematic for many studies. Most researchers collect data from their surrounding community. Thus, it will be impossible to blind the geographic location in many circumstances (e.g., if I recruit psychology students for my study, it will be easy for others to infer that I did so from my place of employment at Dalhousie University). So at a minimum, people will know that participants are probably living relatively close to my place of employment. This is going to be unavoidable in many circumstances, but in most cases it should not be enough to identify participants. However, you will need to consider if this geographical information can be combined with other demographic information to potentially identify people, since it will not be possible to suppress this information in many cases. Aside from that, you’ll just have to do your best to avoid more finely grained geographical information. For example, in Canada, a reverse lookup of postal codes can identify some locations with a surprising degree of accuracy, sometimes down to a particular street!

Participant ID numbers. Almost every dataset will (and should) have a unique identification number for each participant. If this is just a randomly selected number, there are no major issues. However, most researchers I know generate ID numbers in non-random ways. For example, in my own research on romantic couples we assign ID numbers chronologically, with a suffix number of “1” indicating men and “2” indicating women. So ID 003-2 would be the third couple that participated, and the male within that couple. In this kind of research, the most likely person to snoop would probably be the other romantic partner. If I were to leave the ID numbers as originally entered, the romantic partner would easily be able to find their own partner’s data (assuming a heterosexual relationship and that participants remember their own ID number). There are many other algorithms researchers might use to create ID numbers, many of which do not provide helpful information to other researchers, but could be used to identify people. Before freely sharing data, you might consider scrambling the unique ID numbers so that they cannot be a privacy risk (you can, of course, keep a record of the original ID numbers in your own files if needed for administrative purposes).

Some Final Thoughts

Risk of re-identification is never zero. Especially when data are shared openly online, there will always be a risk for participants. Making sure participants are fully informed about the risks involved during the consent process is essential. Careless sharing of data could result in a breach of privacy, which could have extremely negative consequences both for the participants and for your own research program. However, with proper safeguards, the risk of re-identification is low, in part due to some naturally occurring features of research. The slow, plodding pace of scientific research inadvertently protects the privacy of participants: Databases are likely to be 1-3 years old by the time they are posted, and people can change considerably within that time, making them harder to identify. Naturally occurring noise (e.g., missing data, imputation, errors by participants) also impedes the ability to identify people, and the variables psychologists are usually most interested in are often not likely candidates to re-identify someone.

As a community of scientists devoted to making science more transparent and open, we also carry the responsibility of protecting the privacy and rights of participants as much as is possible. I don’t think we have all the answers yet, and there’s a lot more to consider when moving forward. Ethical principles are not static; there are no single “right” answers that will be appropriate for all research, and standards will change as technology and social mores change with each generation. Still, by moving forward with an open mind, and a strong ethical conscience to protect the privacy of participants, I believe that data can really be both open and private.

Share on: TwitterFacebookGoogle+Email