As governments, the private sector, NGOs, and others mobilize to fight the COVID-19 pandemic, we’ve seen calls to use location information—typically drawn from GPS and cell tower data—to inform public health efforts. Among the proposed uses of location data, one of the most widely discussed is analyzing aggregated data about which locations people are visiting, whether they are traveling less, and other collective measurements of individuals’ movement. This analysis might be used to inform judgments about the effectiveness of shelter-in-place orders and other social distancing measures. Projects making use of aggregated location data have graded residents of each state on their social distancing and visualized the travel patterns of people on returning from spring break. Most recently, Google announced that it would publish ongoing “COVID-19 Community Mobility Reports,” which draw on the company’s store of location data to report on changes at a community level in people’s travel to various locations such as grocery stores, parks, and mass transit stations.
Compared to using individualized location data for contact tracing—as many governments around the world are already doing—deriving public health insights from aggregated location data poses far fewer privacy and other civil liberties risks such as restrictions on freedom of expression and association. However, even “aggregated” location data comes with potential pitfalls. This post discusses those pitfalls and describes some high-level best practices for those who seek to use aggregated location data in the fight against COVID-19.
What Does “Aggregated” Mean?
At the most basic level, there’s a difference between “aggregated” location data and “anonymized” or “deidentified” location data. Practically speaking, there is no way to deidentify individual location data. Information about where a person is and has been itself is usually enough to reidentify them. Someone who travels frequently between a given office building and a single family home is probably unique in those habits and therefore identifiable from other readily identifiable sources. One widely cited study from 2013 even found that researchers could uniquely characterize 50% of people using only two randomly chosen time and location data points.
Aggregation to preserve individual privacy, on the other hand, can potentially be useful. Aggregating location data involves producing counts of behaviors instead of detailed timelines of individual location history. For instance, an aggregation might tell you how many people’s phones reported their location as being in a certain city within the last month. Or it might tell you, for a given area in a city, how many people traveled to that area during each hour in the last month. Whether or not a given scheme for aggregating location data works to improve privacy depends deeply on the details: On what timescale is the data aggregated? How large of an area does each count cover? When is a count considered too low and dropped from the data set?
For example, Facebook uses differential privacy techniques such as injecting statistical noise into the dataset as part of the methodology of its “Data for Good” project. This project aggregates Facebook users’ location data and shares it with various NGOs, academics, and governments engaged in responding to natural disasters and fighting the spread of disease, including COVID-19.
There is no single magic formula for aggregating individual location data such that it provides insights that might be useful for some decisions and yet still cannot be reidentified. Instead, it’s a question of tradeoffs. As a matter of public policy, it is critical that user privacy not be sacrificed when creating aggregated location datasets to inform decisions about COVID-19 or anything else.
How Do We Evaluate the Use of Aggregated Location Data to Fight COVID-19?
Because aggregation reduces the risk of revealing intimate information about individuals’ lives, we are less concerned about this use of location data to fight COVID-19 compared to individualized tracking. Of course, the choice of the aggregation parameters generally needs to be done by domain experts. As in the Facebook and Google examples above, these experts will often be working within private companies with proprietary access to the data. Even if they make all the right choices, the public needs to be able to review these choices because the companies are sharing the public’s data. For the experts doing the aggregation, there’s often pressure to reduce the privacy properties in order to generate an aggregate data set that a particular decision-maker claims must be more granular in order to be meaningful to them. Ideally, companies would also consult outside experts before moving forward with plans to aggregate and share location data. Getting public input on whether a given data-sharing scheme sufficiently preserves privacy can help reduce the bias that such pressure creates.
As a result, companies like Google that produce reports based on aggregated location data from users should release their full methodology as well as information about who these reports are shared with and for what purpose. To the extent they only share certain data with selected “partners,” these groups should agree not to use the data for other purposes or attempt to re-identify individuals whose data is included in the aggregation. And, as Google has already done, companies should pledge to end the use of this data when the need to fight COVID-19 subsides.
For any data sharing plan, consent is critical: Did each person consent to the method of data collection, and did they consent to the use? Consent must be specific, informed, opt-in, and voluntary. Ordinarily, users should have the choice of whether to opt-in to every new use of their data, but we recognize that obtaining consent to aggregate previously acquired location data to fight COVID-19 may be difficult with sufficient speed to address the public health need. That's why it's especially important that users should be able to review and delete their data at any time. The same should be true for anyone who truly consents to the collection of this information. Many entities that hold location information, like data brokers that collect location from ads and hidden tracking in apps, can’t meet these consent standards. Yet many of the uses of aggregated location data that we’ve seen in response to COVID-19 draw from these tainted sources. At the very least, data brokers should not profit from public health insights derived from their stores of location data, including through free advertising. Nor should they be allowed to “COVID wash” their business practices: the existence of these data stores is unethical, and should be addressed with new consumer data privacy laws.
Finally, we should remember that location data collected from smartphones has limitations and biases. Smartphone ownership remains a proxy for relative wealth, even in regions like the United States where 80% of adults have a smartphone. People without smartphones tend to already be marginalized, so making public policy based on aggregate location data can wind up disregarding the needs of those who simply don’t show up in the data, and who may need services the most. Even among the people with smartphones, the seeming authoritativeness and comprehensiveness of large scale data can cause leaders to reach erroneous conclusions that overlook the needs of people with fewer resources. For example, data showing that people in one region are traveling more than people in another region might not mean, as first appears, that these people are failing to take social distancing seriously. It might mean, instead, that they live in an underserved area and must thus travel longer distances for essential services like groceries and pharmacies.
In general, our advice to organizations that consider sharing aggregate location data: Get consent from the users who supply the data. Be cautious about the details. Aggregate on the highest level of generality that will be useful. Share your plans with the public before you release the data. And avoid sharing “deidentified” or “anonymized” location data that is not aggregated—it doesn’t work.