Destructive Data Masking

Mar 05, 2024

Oh boy, I’m so sorry if you are desperate enough to have to do what I spent last weekend doing and I sure hope some AI does not unknowingly feed you this, but here we go.

Microsoft SQL Server has this wonderful feature called Dynamic data masking . It allows you to mask sensitive data from certain SQL users, while still retaining the original values.

Data masking is a useful way to mark certain data as sensitive. However, for some reason or another you or someone close to you might come to the conclusion that you need to anonymize the database permanently. If your data is already masked, we can use that information to apply the masking destructively.

Here’s the script

First we need to create some functions

The first masking function will emulate the ’email()’ masking method

CREATE FUNCTION dbo.MaskEmailString(@OriginalEmail VARCHAR(100)) RETURNS VARCHAR(100) AS 
BEGIN 
    RETURN LEFT(@OriginalEmail, 1) + 'xxx@xxxx.com'
END; 

The Second will emulate the different ‘partial()’ masking methods

CREATE FUNCTION dbo.MaskPartialString (@input NVARCHAR(MAX),@prefix int,@custom VARCHAR(64),@postfix int) RETURNS NVARCHAR(MAX) AS
BEGIN
    RETURN LEFT(@input, @prefix) + @custom + RIGHT(@input, @postfix)
END;

Make sure to tweak these if they don’t completely fit your schema. SQL Server will tell you if something doesn’t add up.

Generating the update statements


SELECT
	tbl.[name] as table_name,
	c.[name] as column_name,
	c.masking_function as mask,
	co.DATA_TYPE,
	co.CHARACTER_MAXIMUM_LENGTH as max_length INTO #Masks
FROM
	sys.masked_columns AS c
	JOIN sys.tables AS tbl ON c.[object_id] = tbl.[object_id]
	JOIN INFORMATION_SCHEMA.COLUMNS AS co ON co.COLUMN_NAME = c.name
	AND tbl.name = co.TABLE_NAME
WHERE c.is_masked = 1
ORDER BY c.name, tbl.name;

--Finally We generate a list of UPDATE statements for each masked column in the database
SELECT
	CASE
		--Text default = xxxx
		WHEN mask = 'default()'	AND DATA_TYPE in ('char','nchar','varchar','nvarchar','text','ntext'
		) THEN FORMATMESSAGE('UPDATE %s SET %s = ''%s'';',table_name,column_name,column_name,REPLICATE('x', IIF(max_length>4, 4, max_length)))
		--Numeric defaults = 0
		WHEN mask = 'default()'	AND DATA_TYPE in ('bigint','bit','decimal','int','money','numeric','smallint','smallmoney','tinyint','float','real'
		) THEN FORMATMESSAGE('UPDATE %s SET %s = 0;',table_name,column_name)
		--Date defaults = 1900-01-01
		WHEN mask = 'default()'
		AND DATA_TYPE in ('date','datetime2','datetime','datetimeoffset','smalldatetime','time'
		) THEN FORMATMESSAGE('UPDATE %s SET %s = ''1900-01-01 00:00:00.0000000'';', table_name, column_name)
		--Email uses the function declared above
		WHEN mask = 'email()' THEN FORMATMESSAGE('UPDATE %s SET %s = dbo.MaskEmailString(%s);',table_name,column_name,column_name)
        --Random will set all values in a column to the same random number using the column name as a seed
		WHEN mask LIKE 'random(%' THEN FORMATMESSAGE('UPDATE %s SET %s = RAND(CAST(CAST(%s as varbinary)as int));',table_name,column_name,column_name)
        --Partial will try to parse the existing mask and reconstruct it using the function declared above
		WHEN mask LIKE 'partial(%' THEN FORMATMESSAGE('UPDATE %s SET %s = dbo.MaskPartialString(%s,%s,''%s'',%s);',
            table_name,
            column_name,
            column_name,
            --Prefix argument
			SUBSTRING(mask,CHARINDEX('(', mask) + 1,CHARINDEX(',', mask) - CHARINDEX('(', mask) -1),
            --Custom string argument
			SUBSTRING(mask,CHARINDEX('"', mask) + 1,CHARINDEX('"', mask, CHARINDEX('"', mask) + 1) - CHARINDEX('"', mask) -1), 
            --Postfix argument
			SUBSTRING(mask,CHARINDEX(',', mask, CHARINDEX(',', mask) + 1) + 1,CHARINDEX(')', mask) - CHARINDEX(',', mask, CHARINDEX(',', mask) + 1) -1)
		)
		ELSE null
	END
FROM
	#Masks;

NOTES

  • I’m not going to write a cursor loop that executes this automatically. Run function to generate the individual UPDATE statements and check them for issues.

  • Some esoteric masking strategies are funky and react strangely with compound datatypes.

  • FINAL REMINDER: These will destructively and permanently alter your data!

  • random() is not really supported as I couldn’t find a way to randomize a value for every row. The Random case here will set all values in a column to the same random number using the column name as a seed

sql
Creative

Yasen Dinkov

Automating Azure with Azure Automation (Accounts)