Skip to main content

Data Anonymisation to prevent Data Leakage

With data leaks constantly in the news, I thought I would write a quick blog post about data anonymisation. The problem seems to be that people think it's perfectly acceptable to walk around with sensitive information on mobile devices and removable media. The solution, according to common thought, is to encrypt those devices. This is a solution that should be adopted, but after the more fundamental problem has been addressed. It should not be possible or necessary to store raw sensitive data on mobile devices or removable media!

Assuming that you need the data for business intelligence purposes and that the IT department can't or won't (for some good reason) allow this to be done online through a secure connection, then you must anonymise the data first and then encrypt it. Why do you need to know the names, addresses and credit card numbers of your customers when on the road TK Maxx? Why do you need the names, addresses, dates of birth, national insurance numbers, salaries and bank details of your employees when away from the office UPS? I'm afraid, that the only reason I can think of to have the non-anonymised data is for fraudulent purposes (please send me a comment if you can think of a legitimate reason).

Drawing from Pierangela Samarati's session at IPICS, I'll give a very brief overview of data anonymisation. There are two basic techniques to anonymise data: generalisation and suppression. With generalisation, we use a more general value in place of the specific value, e.g. birth year rather than birth date, postal district rather than full postcode (KT1 rather than KT1 2EE), credit card issuer rather than full credit card number (1234 56** **** **** rather than 1234 5678 9012 3456), etc. Alternatively, we can suppress the sensitive information by removing it totally.

Now there is a whole academic discipline surrounding data anonymity and how to achieve k-anonymity that I won't go into here. I'll just look at what the above means to data such as a normal company might want to use for business intelligence reasons, rather than surveys and data gathering purposes. In this sense, we are trying to protect the privacy of our customers, employees, etc., above all else, rather than have the minimum anonymity possible for the data set. I will use the following table to illustrate the anonymisation.

NameDoBPostcodeCC No.
Alice02/02/64KT1 1AB1234 5678 9012 3456
Bob16/02/64KT1 1BC1234 5678 9012 3467
Charlie08/04/64KT1 1CD1234 6778 9012 3478
David02/04/66KT1 1DE1234 6778 9012 3489
Edgar04/04/66KT1 2AB1234 6778 9012 3490

There are many schemes for anonymising this data, but I'm going to concentrate on Attribute Generalisation combined with Attribute Suppression. This basically means that we will generalise each value at the attribute level (i.e. the same level of generalisation will be applied to all values). Secondly, we will suppress any attribute that uniquely identifies someone. Using minimal generalisation we would get the following table. ('-' denotes suppressed data and '*' denotes generalised data)

NameDoBPostcodeCC No.
-**/02/64KT1 1**1234 5678 9012 34**
-**/02/64KT1 1**1234 5678 9012 34**
--KT1 1**1234 6778 9012 34**
-**/04/66KT1 1**1234 6778 9012 34**
-**/04/66-1234 6778 9012 34**

We have had to suppress Charlie's birthday, because she was the only one born in April 1964. Similarly, Edgar is the only one who lives in KT1 2**. However, we haven't achieved anonymisation here. If we know Charlie was born in April 1964, then this date doesn't appear in the table and only one date is suppressed, so we know her tuple in the table. Similarly, if we know Edgar lives at KT1 2AB, then we know that his is the last tuple. The credit card details should be generalised more than this as well, as others may store the last four digits of a credit card number, so it may be possible to cross reference. Also, why do we need their credit card number for business intelligence? Surely issuer is good enough? So, we can do the following.

NameDoBPostcodeCC No.
-**/**/64KT1 ***1234 56** **** ****
-**/**/64KT1 ***1234 56** **** ****
-**/**/64KT1 ***1234 67** **** ****
-**/**/66KT1 ***1234 67** **** ****
-**/**/66KT1 ***1234 67** **** ****

This gives us a full count of customers, their geographic locations, age and credit card issuer. I suggest that this is enough information to cover most queries that you may wish to run for business intelligence purposes and, therefore, the maximum that should ever be stored on a mobile device or removable media. This data should also be encrypted.
Of course, this doesn't solve all problems. What if you know Edgar was born in 1966? You now know his credit card issuer, which enables you to launch a directed phishing attack on him. Data Anonymisation can fail in the face of attack, particularly when there is external knowledge, which you have no control over. The moral is, don't store sensitive data on mobile devices or removable media. If this really isn't possible to avoid, then you must anonymise it first and encrypt it.

Comments

Popular Posts

Coventry Building Society Grid Card

Coventry Building Society have recently introduced the Grid Card as a simple form of 2-factor authentication. It replaces memorable words in the login process. Now the idea is that you require something you know (i.e. your password) and something you have (i.e. the Grid Card) to log in - 2 things = 2 factors. For more about authentication see this post . How does it work? Very simply is the answer. During the log in process, you will be asked to enter the digits at 3 co-ordinates. For example: c3, d2 and j5 would mean that you enter 5, 6 and 3 (this is the example Coventry give). Is this better than a secret word? Yes, is the short answer. How many people will choose a memorable word that someone close to them could guess? Remember, that this isn't a password as such, it is expected to be a word and a word that means something to the user. The problem is that users cannot remember lots of passwords, so remembering two would be difficult. Also, having two passwords isn't real

Trusteer or no trust 'ere...

...that is the question. Well, I've had more of a look into Trusteer's Rapport, and it seems that my fears were justified. There are many security professionals out there who are claiming that this is 'snake oil' - marketing hype for something that isn't possible. Trusteer's Rapport gives security 'guaranteed' even if your machine is infected with malware according to their marketing department. Now any security professional worth his salt will tell you that this is rubbish and you should run a mile from claims like this. Anyway, I will try to address a few questions I raised in my last post about this. Firstly, I was correct in my assumption that Rapport requires a list of the servers that you wish to communicate with; it contacts a secure DNS server, which has a list already in it. This is how it switches from a phishing site to the legitimate site silently in the background. I have yet to fully investigate the security of this DNS, however, as most

Web Hosting Security Policy & Guidelines

I have seen so many websites hosted and developed insecurely that I have often thought I should write a guide of sorts for those wanting to commission a new website. Now I have have actually been asked to develop a web hosting security policy and a set of guidelines to give to project managers for dissemination to developers and hosting providers. So, I thought I would share some of my advice here. Before I do, though, I have to answer why we need this policy in the first place? There are many types of attack on websites, but these can be broadly categorised as follows: Denial of Service (DoS), Defacement and Data Breaches/Information Stealing. Data breaches and defacements hurt businesses' reputations and customer confidence as well as having direct financial impacts. But surely any hosting provider or solution developer will have these standards in place, yes? Well, in my experience the answer is no. It is true that they are mostly common sense and most providers will conform