By Tobias Manolo
When it comes to big data deployments, Personally Identifiable Information (PII) protection is still a major concern due to current technology that has been designed to protect this information can’t guarantee the safety of that information. Some businesses use PII for targeted ads, products and services, but exposing PII puts someone in the unwanted position of being vulnerable to potential discrimination, profiling, unwarranted scrutiny and exclusion based on demographic data.
To protect people from these issues and the potential misuse of their personal information, organisations use de-identification which detaches information that identifies a person from any data associated with them through the use of de-identification methods, including encryption, anonymisation, key coding, pseudonymisation and data sharding. However, current de-identification methods are being met by re-identification techniques. Keith Carter, adjunct professor at the National University of Singapore, explained that having just one type of data means that data can be put back together in a number of ways. For example, organisations are able to work out the identity of a person by pinpointing an address the person “regularly come from at seven or eight in the morning,” leaving researchers to determine if they go to a school or an office.
Brian Christian, CTO at Zettaset, a big data management platform, explains: “Typically an ETL procedure loads big data from a traditional RDBMS data warehouse onto a Hadoop cluster. Since most of that data is unstructured, the system runs a job in order to structure the data. Then the system hands it off to a relational database to serve it up, to a BI analyst or another data warehouse running Hadoop, for storage, reference and retrieval. Any big data hand-off or moves cross vulnerable junctions.”
The issue comes about because those who created big data solutions didn’t intend, or expect, them to be used in the way that they are today. Today, vendors and organisations are adding security aspects onto distributed computing architectures that were not designed for firewalls and IDS, which creates a major problem.
Stanford’s Law Review article suggests that a key part of today’s business models is de-identification, particularly in healthcare, cloud computing and online behavioural advertising. Enterprises that use de-identification heavily as their privacy solution could mean that they are unable to fund, let alone find, a different solution, thereby opening the gateway to further re-identification abuse.
Ultimately, and this is where it can be difficult, there might not be a suitable solution to privacy concerns when it comes to big data, just solutions that are able to protect the enterprise from liability; without resolving this, abuse will continue, albeit at lower levels as technology drives forward the delivery of better solutions.