Skip to main content

Data Anonymisation to prevent Data Leakage

With data leaks constantly in the news, I thought I would write a quick blog post about data anonymisation. The problem seems to be that people think it's perfectly acceptable to walk around with sensitive information on mobile devices and removable media. The solution, according to common thought, is to encrypt those devices. This is a solution that should be adopted, but after the more fundamental problem has been addressed. It should not be possible or necessary to store raw sensitive data on mobile devices or removable media!

Assuming that you need the data for business intelligence purposes and that the IT department can't or won't (for some good reason) allow this to be done online through a secure connection, then you must anonymise the data first and then encrypt it. Why do you need to know the names, addresses and credit card numbers of your customers when on the road TK Maxx? Why do you need the names, addresses, dates of birth, national insurance numbers, salaries and bank details of your employees when away from the office UPS? I'm afraid, that the only reason I can think of to have the non-anonymised data is for fraudulent purposes (please send me a comment if you can think of a legitimate reason).

Drawing from Pierangela Samarati's session at IPICS, I'll give a very brief overview of data anonymisation. There are two basic techniques to anonymise data: generalisation and suppression. With generalisation, we use a more general value in place of the specific value, e.g. birth year rather than birth date, postal district rather than full postcode (KT1 rather than KT1 2EE), credit card issuer rather than full credit card number (1234 56** **** **** rather than 1234 5678 9012 3456), etc. Alternatively, we can suppress the sensitive information by removing it totally.

Now there is a whole academic discipline surrounding data anonymity and how to achieve k-anonymity that I won't go into here. I'll just look at what the above means to data such as a normal company might want to use for business intelligence reasons, rather than surveys and data gathering purposes. In this sense, we are trying to protect the privacy of our customers, employees, etc., above all else, rather than have the minimum anonymity possible for the data set. I will use the following table to illustrate the anonymisation.

NameDoBPostcodeCC No.
Alice02/02/64KT1 1AB1234 5678 9012 3456
Bob16/02/64KT1 1BC1234 5678 9012 3467
Charlie08/04/64KT1 1CD1234 6778 9012 3478
David02/04/66KT1 1DE1234 6778 9012 3489
Edgar04/04/66KT1 2AB1234 6778 9012 3490

There are many schemes for anonymising this data, but I'm going to concentrate on Attribute Generalisation combined with Attribute Suppression. This basically means that we will generalise each value at the attribute level (i.e. the same level of generalisation will be applied to all values). Secondly, we will suppress any attribute that uniquely identifies someone. Using minimal generalisation we would get the following table. ('-' denotes suppressed data and '*' denotes generalised data)

NameDoBPostcodeCC No.
-**/02/64KT1 1**1234 5678 9012 34**
-**/02/64KT1 1**1234 5678 9012 34**
--KT1 1**1234 6778 9012 34**
-**/04/66KT1 1**1234 6778 9012 34**
-**/04/66-1234 6778 9012 34**

We have had to suppress Charlie's birthday, because she was the only one born in April 1964. Similarly, Edgar is the only one who lives in KT1 2**. However, we haven't achieved anonymisation here. If we know Charlie was born in April 1964, then this date doesn't appear in the table and only one date is suppressed, so we know her tuple in the table. Similarly, if we know Edgar lives at KT1 2AB, then we know that his is the last tuple. The credit card details should be generalised more than this as well, as others may store the last four digits of a credit card number, so it may be possible to cross reference. Also, why do we need their credit card number for business intelligence? Surely issuer is good enough? So, we can do the following.

NameDoBPostcodeCC No.
-**/**/64KT1 ***1234 56** **** ****
-**/**/64KT1 ***1234 56** **** ****
-**/**/64KT1 ***1234 67** **** ****
-**/**/66KT1 ***1234 67** **** ****
-**/**/66KT1 ***1234 67** **** ****

This gives us a full count of customers, their geographic locations, age and credit card issuer. I suggest that this is enough information to cover most queries that you may wish to run for business intelligence purposes and, therefore, the maximum that should ever be stored on a mobile device or removable media. This data should also be encrypted.
Of course, this doesn't solve all problems. What if you know Edgar was born in 1966? You now know his credit card issuer, which enables you to launch a directed phishing attack on him. Data Anonymisation can fail in the face of attack, particularly when there is external knowledge, which you have no control over. The moral is, don't store sensitive data on mobile devices or removable media. If this really isn't possible to avoid, then you must anonymise it first and encrypt it.


Popular Posts

Coventry Building Society Grid Card

Coventry Building Society have recently introduced the Grid Card as a simple form of 2-factor authentication. It replaces memorable words in the login process. Now the idea is that you require something you know (i.e. your password) and something you have (i.e. the Grid Card) to log in - 2 things = 2 factors. For more about authentication see this post . How does it work? Very simply is the answer. During the log in process, you will be asked to enter the digits at 3 co-ordinates. For example: c3, d2 and j5 would mean that you enter 5, 6 and 3 (this is the example Coventry give). Is this better than a secret word? Yes, is the short answer. How many people will choose a memorable word that someone close to them could guess? Remember, that this isn't a password as such, it is expected to be a word and a word that means something to the user. The problem is that users cannot remember lots of passwords, so remembering two would be difficult. Also, having two passwords isn't real

Trusteer or no trust 'ere...

...that is the question. Well, I've had more of a look into Trusteer's Rapport, and it seems that my fears were justified. There are many security professionals out there who are claiming that this is 'snake oil' - marketing hype for something that isn't possible. Trusteer's Rapport gives security 'guaranteed' even if your machine is infected with malware according to their marketing department. Now any security professional worth his salt will tell you that this is rubbish and you should run a mile from claims like this. Anyway, I will try to address a few questions I raised in my last post about this. Firstly, I was correct in my assumption that Rapport requires a list of the servers that you wish to communicate with; it contacts a secure DNS server, which has a list already in it. This is how it switches from a phishing site to the legitimate site silently in the background. I have yet to fully investigate the security of this DNS, however, as most

How Reliable is RAID?

We all know that when we want a highly available and reliable server we install a RAID solution, but how reliable actually is that? Well, obviously, you can work it out quite simply as we will see below, but before you do, you have to know what sort of RAID are you talking about, as some can be less reliable than a single disk. The most common types are RAID 0, 1 and 5. We will look at the reliability of each using real disks for the calculations, but before we do, let's recap on what the most common RAID types are. Common Types of RAID RAID 0 is the Stripe set, which consists of 2 or more disks with data written in equal sized blocks to each of the disks. This is a fast way of reading and writing data to disk, but it gives you no redundancy at all. In fact, RAID 0 is actually less reliable than a single disk, as all the disks are in series from a reliability point of view. If you lose one disk in the array, you've lost the whole thing. RAID 0 is used purely to speed up dis