Overview of Data Masking Methods

“Community College Student finds his Social Security Number through Google search.” No, that isn’t a headline from The Onion — it actually happened.  It happened when community college staff members tested a new online application that used files with sensitive unaltered data on an unsecure server.

While this case was especially egregious, use of unmasked production data in test and development environments is common because developers and testers need realistic datasets in order to work effectively. To prevent the risk of exposing Social Security numbers, payroll information, personal addresses, and other sensitive data to the wrong people, developers should perform data masking.

“Data masking” means altering data from its original state to protect it. There are a variety of methods that are commonly used. Let’s look at some examples.

Lookup Substitution Method

Problem: Let’s say we have a production database with employee Social Security Numbers. This is information you would not want sitting in a test environment.

Solution: Add a lookup table in the production environment that will provide an alias for the value.

Result: The testing environment does not have sensitive employee information but still has realistic data.

Encryption

Problem: We have the same issue this time but instead of having a lookup table, we don’t want to rely on a lookup table, which might get compromised.

Solution: The data is encrypted. Only individuals who need to see the data will be given the password. Encryption works by changing information into an unreadable state by using complex algorithms to make the data nonsensical until decrypted back to its original state. Here you can see that the data is completely nonsensical while encrypted.

Result: Individuals who need to see the data will be able to see it while others will not be given access. Note: If encryption is used without any other data masking technique, any sensitive information will be viewable once decrypted.

Redaction Method

Problem: In this example, customer credit card numbers are listed. This information is sensitive and is not needed in the testing and development environment.

Solution/result: Sometimes sensitive data can be replaced with a generic value.  This should only be used when this value is not needed for development or QA purposes.

Averaging

Problem: In some instances you may have sensitive data you don’t want to reflect individually but you do want to reflect on an aggregate and average basis.  For example, the QA team may need to verify that the table which has money allocated for salaries matches the total of the annual summary amount.

The below example has a table with employee salaries. This is highly confidential and should be known on a need-to-know basis.

Solution:  Rather than having employees view their colleagues’ salaries, change all the Annual Salary values to the average salary (64,750). This preserves the same total for the column (259000), while protecting the sensitive data.   In this example, every Salary was put as the average salary.

Result: Employees are not able to see the salaries of their fellow colleagues. All they see is the average salary.

Shuffling

Problem: Similar to the above, but with a twist. In some instances you may have sensitive data that needs protecting, but you still need unique values.  For example, the QA team may need to verify that the table which has money allocated for salaries matches the total of the annual summary amount.

Solution: Scramble the Annual Salaries away from the original employee. (Note: This method is more effective with a larger dataset.)

Result: Employees are able to see the salaries table with unique salary values but still not able to tell which salary belongs to which employee.

Date Aging

Problem: You have contract active dates you don’t want known publicly.

Solution: A policy can be individually set for each date field. For example, Contract active dates can be set back to 500 days. Data ranges must be within an acceptable range so that there are no negative ramifications for individuals using these masked dates.

Result: The actual contract active dates are not visible in the QA and Development environment. Note: One of the weaknesses of this methodology is that once one record is compromised, all records are compromised. If it is found out that the actual date for ABC Partners is 1/5/2008 then it is easy to deduce that the policy calls for a date aging of 500 days.

These are a few of the methods used when masking data and should be part of any organization’s best practices.

Photo Credit:JD Hancock via Compfight cc

There’s more to explore at Smartbridge.com!

Sign up to be notified when we publish articles, news, videos and more!