Original URL: https://www.theregister.com/2013/08/19/big_data_security_considerations/

Building big data? Are you building a security headache too?

I didn't mean to isn't good enough

By Danny Bradbury

Posted in On-Prem, 19th August 2013 10:39 GMT

The world and its dog has been shocked by the Prism news story. Early in June, we found out that the US National Security Agency (NSA) had developed a secret data-gathering mechanism to steal all our data and store it in a large data warehouse.

We are outraged that it is being mined, searched and otherwise prodded. But do we really think that big data security problems stop at Google, Facebook, Microsoft and Fort Meade?

The private sector has been collecting data on all of us for ages. It is stored in massive data sets, often spread between multiple sources. What makes us think this is any more secure? At least the NSA is well trained in keeping it all under lock and key.

Social trend

What does “big data” mean, anyway? Some describe it – wrongly – as simply a lot of data in a relational database. But if that were the case, then the security challenges would be the same as for conventional databases. And they aren’t.

Others view it as data sets so large that they cannot be handled by traditional relational tools. But we have had that kind of thing for years, in the form of data warehouses.

One difference is that modern large data sets often consist of far more varied data, including unstructured stuff such as tweets. Big data is inherently social, meaning that much of it is personal.

Big data is also supposed to perform better. Really large data sets can be tuned to look for “weak signals” – emerging trends that a traditional data warehouse-based business intelligence system may not have spotted.

The goal is also to have them work quickly, so that they can help companies predict and react to market trends efficiently. No more three-day turnarounds for specific reports here.

So big data is more complex, more flexible and faster. It is powerful stuff, but with power also comes risk. Big data carries unforeseen security consequences, warns Tony Lock, programme director at analyst firm Freeform Dynamics.

“Customers give you data to use for certain purposes, but they may not have allowed you to start crunching it to answer all kinds of questions,” he says.

Many companies have not considered those issues, he adds.

In practice, says PA Consulting IT specialist James Mucklow, this means you must have a clear policy, explaining what you are going to use customers’ data for.

Big data can provide deep-dive profiles of individuals by using sources that we are not always aware of. Take loyalty cards, for example.

The credit-card industry spends millions developing and enforcing data security and privacy guidelines for the storage of personal financial information. Anyone dealing with currency transactions of any sort is heavily regulated. But loyalty points are not currency and don’t face the same kinds of rules.

We know where you live

Yet loyalty card customers provide mounds of personal information, both directly and indirectly. They may hand over names and addresses, gender, phone number, birthday, and email addresses. Sometimes, they even reveal their income.

Even basic postal code information can enable companies to infer more information about you, based on the demographic data for your area.

Every purchasing decision can be tracked and sucked into a wider data set. Suddenly, data has gotten much bigger – and much more personal.

“You have to be sure that you are seen to be using the data in a responsible way,” says Mucklow. He outlines the story of US retailer Target, which figured out that a girl was pregnant before she had told her parents and let the cat out of the bag by sending her leaflets with advertising.

One of the biggest challenges for companies holding big data sets is that they are like the pan-dimensional, hyper-intelligent beings that built Douglas Adams’s computer, Deep Thought.

They asked the computer the meaning of Life, the Universe, and Everything. After 7.5 million years, it told them that the answer was 42 and it transpired that they didn’t really understand the question.

Mystery questions

Big data sets are massive pools of data, designed to answer questions that people don’t even know they want the answer to. It is tricky defining privacy policies that provide enough flexibility to make proper use of the data and enough privacy to ensure that customers are happy.

Ideally, all of this data would be rendered anonymous, but this can provide a false sense of security, warns Jamal Elmellas, technical director for Auriga, a security consulting firm.

“The mechanism you use to anonymise that data must be sufficiently robust to not breach the Data Protection Act but also leave the data in a state that is useful for what you want to achieve,” he says. “It is a very fine line.”

Unfortunately, companies get it wrong. Data ends up being “pseudo-anonymised”, he warns, making it relatively easy to reassemble into information that can help to identify individuals.

We have seen this before. Researchers used re-identification techniques to find user identities in an anonymous set of data published by Netflix in 2006. They matched that data to IMDb, a third-party source of movie reviews written by individuals.

This shows how big data’s biggest strength – its ability to derive data from different sources – is also its biggest security weakness.

Increased exposure happens when multiple data sources are brought together

“Different consumer organisations collect information with one-dimensional views of a consumer,” says Hunter Albright, CEO of Beyond Analysis, a consulting firm that specialises in big data.

“It has limited value and risk because of that. Increased exposure happens for an individual when multiple data sources are brought together.”

Companies have long understood how to classify certain data sets as sensitive or non-sensitive, for privacy purposes. Information tagging for example, has been a well-understood technique here. But they don’t always understand that seemingly non-sensitive data sets can become sensitive when combined.

“The way to combat this is to understand and define the desired business outcome before you collect or process the data,” says Elmellas.

“Know the lifecycle. By doing this, you can use the same traditional classification techniques but vary them as required throughout the project.”

Understanding the context in which the information is used is crucial in extracting the information you need from the huge piles of data you have collected.

Companies know this, just as the spooks up in Maryland and at GCHQ do. Context is also important for private enterprises wanting to manipulate big data in a secure way.

These challenges are difficult enough when dealing with big data inside your own domain. What about when you are shipping it out to third parties?

Don’t think you won’t. Logistics chains are a prime example, says Clive Longbottom, founder of analyst firm Quocirca.

Information may move from the retailer to the OEM manufacturer and the fulfilment company, for example. This data enables these stakeholders to deliver a product efficiently and also lets the customer track progress through a self-service portal. But companies must make sure that the information is being used sensibly at all stages.

“The information can (and should) be hashed with an identifier, rather than being stored with the personal identifiable data [PID] as it moves along the chain,” says Longbottom.

Weak links in the chain

Any PID is stored in the company’s database in hashed, encrypted form, he says, and the reference is then matched with a certificate to create a public token. That token is used if any stakeholder needs to see the PID related to the customer order – and the customer has to agree first.

“This also makes the sending of data outside a legal jurisdiction easier,” says Longbottom.

"India can work against the data to their hearts’ content, but they do not see anything that has PID in it. It is only the work packages that get returned and then, through use of the hashed security, add value to the data stored.”

So, you understand the subtle interplay of data classification, business process and risk? Good for you. But sometimes, a mere software bug can send things awry.

One of the biggest data sets of all is Facebook’s. It just ran foul of privacy rules after it accidentally divulged the personal information of six million loyal users.

When a user uses the Download Your Information feature in Facebook, the social network spits out all of that user’s data, including the phone and email addresses for any contacts they have uploaded to its address book.

The bug added any new address book information for those contacts uploaded by other users, enabling, say, abusive ex-partners to access a person’s new telephone number and email address.

Researchers found that uploading one public email address for an individual could harvest a dozen extra pieces of data about that person. The individual doesn’t even have to be a Facebook user.

Clearly, big data security leaks can come from many places. Misclassification of data is one, as is the ability to combine information from multiple sources. Simple software bugs are a third.

Companies collecting vast buckets of data about individuals may not intend to use it maliciously – but it doesn’t mean that others won’t.

By all means, build massive data sets and use them to find answers to questions you don’t even understand yet. But make sure you get your legal and computational ducks in a row before you start down this road. ®