“De-Identified” Data under the CCPA – Some Words of Caution
The CCPA (as recently amended) takes effect on January 1, 2020. As is well known, the law imposes a wide range of disclosure and other operational obligations on businesses that collect, use, or sell consumers’ “personal information.” Countless businesses are spending large amounts of time and money figuring out what they have to do to comply with the law.
Some of those businesses may have been cheered by one of the amendments recently approved by the California legislature and signed into law by Governor Newsom. As amended, the law flatly states that “personal information” – the type of information collected and used by businesses that triggers so many burdensome and expensive obligations – “does not include information that is de-identified or aggregated.” Cal. Civ. Code § 1798.140(o)(3) (as amended).
Is De-Identifying the Answer to CCPA Compliance?
Maybe, they ask, can businesses escape some of the burdens of the law by “de-identifying” the information they have about their customers? (“Aggregated” information is, essentially, an average of information about a group of customers – something else entirely. See Cal.Civ. Code § 1798.140(a).)
Unfortunately, there are reasons to be skeptical that de-identification will be useful in significantly mitigating CCPA obligations. One key reason is that the law defines “de-identified” in an extremely stringent way. For a business to count information it has collected about consumers as de-identified, the following criteria must be met:
- The information “cannot reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly, to a particular customer;” and
- The business must have implemented technical safeguards and business processes that prohibit re-identification; and
- The business must have implemented business processes to prevent inadvertent release even of the de-identified data; and
- The business must not make any attempt to re-identify the information.
Cal. Civ. Code § 1798.140(h).
The first requirement will be particularly hard to meet. It’s hard to see how a business could make the required judgment of “reasonableness,” because the law doesn’t define any metrics to decide how difficult it must be to re-identify the data. On this exact point, data scientists are getting increasingly clever at figuring ways to do exactly that.
Third Parties Could Re-Identify Customers With Outside Data Sources
One key problem is that re-identification can be highly accurate in cases where a supposedly de-identified dataset is analyzed using outside sources of information that are not, themselves, de-identified. These outside data sources can be used as, in effect, a key to unlock the identities of particular people in the supposedly de-identified dataset.
- Here’s how it works: Suppose a business strips away all identifying information from a list of transactions – no names, no addresses, no phone numbers, nothing that could seemingly tie the transactions to any particular person. The dataset would simply indicate that one (unidentified) person bought various items at various prices on various days, that another (also unidentified) person bought other items, etc.
- Suppose that even the business has no way to connect any particular customer on its transaction list with any actual person – all it knows is that somebody made each set of purchases. In that case – even where the business itself has lost the ability to make the connection – a third party with a relatively small set of a specific person’s specific transaction data could reliably determine which customer on the list that person really was.
This kind of re-identification attack has been shown to work with, among other things, credit card purchases and mobile phone location information.
These kinds of attacks on supposedly de-identified data have been well-publicized, which may explain why the CCPA requires a business to have procedures to prevent accidental disclosure of de-identified data. Indeed, looking at the CCPA’s definition of “de-identify,” it appears that the law assumes that re-identification is inherently possible – if re-identification wasn’t possible, there’d be no reason to require technical safeguards and business processes to prohibit it, and no reason to worry about inadvertent release.
Unfortunately, the assumption of the law is probably right. Even if the business itself can’t figure out how to re-identify the information, that doesn’t mean that someone else couldn’t do so if the information were released, or if the business hired someone more clever to take on the job of re-identification.
It’s About What Everyone Else Already Knows
This concern parallels Kerckhoffs’ Principle in cryptography (other than a user’s secret key, everything about how an encryption system works should be public), and Joy’s Law in management (“No matter who you are, most of the smartest people work for someone else”). The potential for re-identification doesn’t depend on what the business knows or does at any one time – the business itself may be entirely unable to re-identify the data.
Instead, the potential for re-identification depends on what everyone else knows and can do with the dataset.
This doesn’t mean that de-identification is an impossible standard to meet. As data scientists and others continue to turn their attention to technologies and techniques to protect privacy, new methods of de-identification may be developed that are robust against presently-understood methods of re-identification using outside datasets.
For example, it was recently reported that it is possible to use so-called “adversarial examples” – fictitious data designed to confound artificial intelligence-based data mining – to protect the privacy of people who post some information online, but want other data kept secret. Perhaps these or other methods could be adapted to the problem of de-identification of consumer information – which, on the surface at least, seems analogous, in that the business wants to keep using some information about its customers, without revealing – to itself or others – who those customers actually are.
But if de-identification isn’t impossible, it is surely much, much harder than businesses might hope for.
Expertise, Not Common Sense, Should Guide Compliance
Businesses subject to the CCPA should realize that common-sense intuitions about what amounts to adequate de-identification are very likely to be wrong. So, if a business wants to treat some of its information as adequately de-identified for purposes of the CCPA, it would be a good idea to have an expert data scientist both provide advice as to what constitutes adequate de-identification and documentation confirming the expert’s view that re-identification of a dataset is not “reasonably” possible.
Health privacy experts will see that this draws on the option of expert certification of adequate de-identification under the Health Insurance Portability and Affordability Act regulations, 45 C.F.R. § 164,514(B)(1). De-identification of consumer information under the CCPA, however, is quite different from de-identification of heath data under HIPAA. Among other things, unlike under HIPAA, having a documented expert opinion that the de-identification was reasonable is no guarantee under the CCPA that the business won’t be subject to enforcement action if its practices become known.
But such documentation would at least be a first line of defense.
Perhaps, at some point, the California Attorney General will issue regulations clarifying how to determine if particular methods of de-identification are sufficiently “reasonable” to pass muster under the CCPA. But as of now, businesses should probably look at de-identification, not as a potential way to mitigate obligations under the law, but, instead, as something of a minefield all its own.