Skip to main content

Data Anonymisation to prevent Data Leakage

With data leaks constantly in the news, I thought I would write a quick blog post about data anonymisation. The problem seems to be that people think it's perfectly acceptable to walk around with sensitive information on mobile devices and removable media. The solution, according to common thought, is to encrypt those devices. This is a solution that should be adopted, but after the more fundamental problem has been addressed. It should not be possible or necessary to store raw sensitive data on mobile devices or removable media!

Assuming that you need the data for business intelligence purposes and that the IT department can't or won't (for some good reason) allow this to be done online through a secure connection, then you must anonymise the data first and then encrypt it. Why do you need to know the names, addresses and credit card numbers of your customers when on the road TK Maxx? Why do you need the names, addresses, dates of birth, national insurance numbers, salaries and bank details of your employees when away from the office UPS? I'm afraid, that the only reason I can think of to have the non-anonymised data is for fraudulent purposes (please send me a comment if you can think of a legitimate reason).

Drawing from Pierangela Samarati's session at IPICS, I'll give a very brief overview of data anonymisation. There are two basic techniques to anonymise data: generalisation and suppression. With generalisation, we use a more general value in place of the specific value, e.g. birth year rather than birth date, postal district rather than full postcode (KT1 rather than KT1 2EE), credit card issuer rather than full credit card number (1234 56** **** **** rather than 1234 5678 9012 3456), etc. Alternatively, we can suppress the sensitive information by removing it totally.

Now there is a whole academic discipline surrounding data anonymity and how to achieve k-anonymity that I won't go into here. I'll just look at what the above means to data such as a normal company might want to use for business intelligence reasons, rather than surveys and data gathering purposes. In this sense, we are trying to protect the privacy of our customers, employees, etc., above all else, rather than have the minimum anonymity possible for the data set. I will use the following table to illustrate the anonymisation.

NameDoBPostcodeCC No.
Alice02/02/64KT1 1AB1234 5678 9012 3456
Bob16/02/64KT1 1BC1234 5678 9012 3467
Charlie08/04/64KT1 1CD1234 6778 9012 3478
David02/04/66KT1 1DE1234 6778 9012 3489
Edgar04/04/66KT1 2AB1234 6778 9012 3490

There are many schemes for anonymising this data, but I'm going to concentrate on Attribute Generalisation combined with Attribute Suppression. This basically means that we will generalise each value at the attribute level (i.e. the same level of generalisation will be applied to all values). Secondly, we will suppress any attribute that uniquely identifies someone. Using minimal generalisation we would get the following table. ('-' denotes suppressed data and '*' denotes generalised data)

NameDoBPostcodeCC No.
-**/02/64KT1 1**1234 5678 9012 34**
-**/02/64KT1 1**1234 5678 9012 34**
--KT1 1**1234 6778 9012 34**
-**/04/66KT1 1**1234 6778 9012 34**
-**/04/66-1234 6778 9012 34**

We have had to suppress Charlie's birthday, because she was the only one born in April 1964. Similarly, Edgar is the only one who lives in KT1 2**. However, we haven't achieved anonymisation here. If we know Charlie was born in April 1964, then this date doesn't appear in the table and only one date is suppressed, so we know her tuple in the table. Similarly, if we know Edgar lives at KT1 2AB, then we know that his is the last tuple. The credit card details should be generalised more than this as well, as others may store the last four digits of a credit card number, so it may be possible to cross reference. Also, why do we need their credit card number for business intelligence? Surely issuer is good enough? So, we can do the following.

NameDoBPostcodeCC No.
-**/**/64KT1 ***1234 56** **** ****
-**/**/64KT1 ***1234 56** **** ****
-**/**/64KT1 ***1234 67** **** ****
-**/**/66KT1 ***1234 67** **** ****
-**/**/66KT1 ***1234 67** **** ****

This gives us a full count of customers, their geographic locations, age and credit card issuer. I suggest that this is enough information to cover most queries that you may wish to run for business intelligence purposes and, therefore, the maximum that should ever be stored on a mobile device or removable media. This data should also be encrypted.
Of course, this doesn't solve all problems. What if you know Edgar was born in 1966? You now know his credit card issuer, which enables you to launch a directed phishing attack on him. Data Anonymisation can fail in the face of attack, particularly when there is external knowledge, which you have no control over. The moral is, don't store sensitive data on mobile devices or removable media. If this really isn't possible to avoid, then you must anonymise it first and encrypt it.

Comments

Popular Posts

You say it's 'Security Best Practice' - prove it!

Over the last few weeks I have had many conversations and even attended presentations where people talk about 'Security Best Practices' and how we should all follow them. However, 'Best Practice' is just another way of saying 'What everyone else does!' OK, so if everyone else does it and it's the right thing to do, you should be able to prove it. The trouble is that nobody ever measures best practice - why would you? If everyone's doing it, it must be right.

Well, I don't agree with this sentiment. Don't get me wrong, many of the so-called best practices are good for most organisations, but blindly following them without thought for your specific business could cause as many problems as you solve. I see best practice like buying an off-the-peg suit - it will fit most people acceptably well if they are a fairly 'normal' size and shape. However, it will never fit as well as a tailored suit and isn't an option for those of us who are ou…

Coventry Building Society Grid Card

Coventry Building Society have recently introduced the Grid Card as a simple form of 2-factor authentication. It replaces memorable words in the login process. Now the idea is that you require something you know (i.e. your password) and something you have (i.e. the Grid Card) to log in - 2 things = 2 factors. For more about authentication see this post.

How does it work? Very simply is the answer. During the log in process, you will be asked to enter the digits at 3 co-ordinates. For example: c3, d2 and j5 would mean that you enter 5, 6 and 3 (this is the example Coventry give). Is this better than a secret word? Yes, is the short answer. How many people will choose a memorable word that someone close to them could guess? Remember, that this isn't a password as such, it is expected to be a word and a word that means something to the user. The problem is that users cannot remember lots of passwords, so remembering two would be difficult. Also, having two passwords isn't really…

Security is a mindset not a technology

I often get asked what I look for when hiring security professionals and my answer is usually that I want the right attitude first and foremost - knowledge is easy to gain and those that just collect pieces of paper should maybe think about gaining experience rather than yet more acronyms. However, it's difficult to get someone to change their mindset, so the right attitude is very important. But what is the right attitude?


Firstly, security professionals differ from developers and IT engineers in their outlook and approach, so shouldn't be lumped in with them, in my opinion. The mindset of a security professional is constantly thinking about what could go wrong (something that tends to spill over into my personal life as well, much to the annoyance of my wife). Contrast this with the mindset of a developer who is being measured on their delivery of new features. Most developers, or IT engineers, are looking at whether what they have delivered satisfies the requirements from t…