Data Obfuscation through Random String replacement in SQL Server

As a SQL Server developer I often need test databases that resemble my client databases as much as possible. But with the increasingly and correct, in my opinion, concern over data privacy you often need to obfuscate data. This is also called scrubbing the data. I have a common table that is used as a repository for a lot of client data. I need to write a script that would make the data obscured but keep the weight of the data there so I could continue testing. I use the following script to randomly pick from a 1000 Unicode characters to build strings to replace those customers had added. Here is the heart of it (in a loop with the length of the target string as its limit):

declare @somevalueid int;
declare @intstringlength int;
declare @intcounter int;
declare @mystring nvarchar(50);
declare @mynewstring nvarchar(50);

select @mystring=mystring from tblguitar where guitarid_pk=@somevalueid
select @intstringlength= len(@mystring)

while @intcounter > 0
begin
set @mynewstring=@mynewstring+nchar((cast(rand()*1000 as int))+65)
set @intcounter = @intcounter-1
end
update tblguitar set mystring=@mynewstring where guitarid_pk=@somevalueid

It should prove easy to adapt this into your code and build random strings of any length.
I could, too, restrict which Unicode or regular ASCII characters I want simply by figuring out the character range and code numbers involved. I used Unicode because it’s so big and I like the weird characters I come up with. With a large database it’s unlikely I’ll have 2 entries the same using 1000 characters; even so you may need to check if the entry already exists if you, like I, require unique entries. For those with less worry about that, here’s two examples limiting you to the normal alphabetic range:

select nchar((cast(rand()*1000 as int)%26)+97) –just a-z
select nchar((cast(rand()*1000 as int)%26)+65) –just A-Z

In the end, it’s the shape of the data that is important to me, not the actual meaning. If I can retain that shape and work on without worry that I’m compromising my client’s privacy, so much the better.

About these ads

9 thoughts on “Data Obfuscation through Random String replacement in SQL Server

  1. Be careful by the way. If you copy my a-z or A-Z code above, the comment characters seem to be changed to an em dash (a long dash) so what was:

    ‘-’ then ‘-’ then ‘just a-z’

    became

    - then ‘just a-z’

    and it isn’t a comment any longer

  2. Hey..

    There are several tools in the market today that accomplish most of the tasks – Integrity constraints, mapping rules, etc., and they can also quickly create a clone of your production DB with artificially generated data. The new DB created will look exactly like your prod DB but with non-existent customer names and SSNs – although they look real. Here are some of the well known tools.
    Optim
    Datavantage
    Camouflage
    …..

  3. I agree those are out there but do you trust them enough to stake your job on? I will always prefer simple code that I have control over when it comes to my customer’s data.

  4. you can actuallu use the same data and obscure it in a random manner that is irreversible. A simple code wont do though.

  5. I would like to do something like this that I can fire at a column in a table and work through row by row till the end substituting each character in the text field string with a random character whilst preserving spaces. I need to preserve spaces because of text wrapping in a report. I will create a translated table to hold the results. I want to create a Stored Procedure to do this where I want to pass a table name and column name. Can you provide a link to some guidance for me?

  6. Stephen,

    To preserve spaces you would need to create a loop that parses through each character and just passes through spaces. You’d also have to restrict the pool of random characters so it doesn’t include a space.

    Hope that helps.

  7. Paul,

    I did basically the same thing before I read far enough down to see your post. Sadly your link doesn’t seem to work anymore. I went a step further in differentiating vowels and consonants. I leave “y” alone as well as spaces, etc. This preserves the shape of the data pretty well.

    Below was thrown together for internal use and not robustly tested or necessarily optimized.

    Karl

    CREATE VIEW NewIDForFunction AS
    SELECT NEWID() idNew
    GO

    CREATE FUNCTION RandomLetter() RETURNS CHAR(1) BEGIN
    DECLARE @cReturn CHAR
    SELECT @cReturn=CHAR(ABS(CHECKSUM(idNew))%26+65) FROM NewIDForFunction
    RETURN @cReturn
    END
    GO

    CREATE FUNCTION RandomVowel_NotY() RETURNS CHAR(1) BEGIN
    DECLARE @cReturn CHAR
    SELECT @cReturn=SUBSTRING(‘AEIOU’,ABS(CHECKSUM(idNew))%5+1,1)
    FROM NewIDForFunction
    RETURN @cReturn
    END
    GO

    CREATE FUNCTION RandomConsonate_NotY() RETURNS CHAR(1) BEGIN
    DECLARE @cReturn CHAR
    SELECT @cReturn=SUBSTRING(‘BCDFGHJKLMNPQRSTVWXZ’,ABS(CHECKSUM(idNew))%20+1,1)
    FROM NewIDForFunction
    RETURN @cReturn
    END
    GO

    CREATE FUNCTION RandomDigit() RETURNS CHAR(1) BEGIN
    DECLARE @cReturn CHAR
    SELECT @cReturn=CAST(ABS(CHECKSUM(idNew))%10 AS CHAR(1)) FROM NewIDForFunction
    RETURN @cReturn
    END
    GO

    CREATE FUNCTION RandomReplace_Constrained(@sIn VARCHAR(255))
    RETURNS VARCHAR(255) AS
    BEGIN

    DECLARE @cbIn TINYINT
    DECLARE @sResult VARCHAR(50)
    DECLARE @cbProcessed TINYINT
    DECLARE @sNextChar CHAR(1)
    DECLARE @nNextChar TINYINT

    SET @cbProcessed = 1
    SET @sResult = ”
    SELECT @cbIn=LEN(@sIn)

    WHILE @cbProcessed <= @cbIn BEGIN
    SET @sNextChar = SUBSTRING(@sIn,@cbProcessed,1)
    SET @nNextChar = ASCII(@sNextChar)
    SELECT @sResult = @sResult +
    CASE
    WHEN @nNextChar BETWEEN 48 AND 57 THEN dbo.RandomDigit()
    WHEN @nNextChar IN (65,69,73,79,85) THEN dbo.RandomVowel_NotY()
    WHEN @nNextChar IN (89,121) THEN @sNextChar
    WHEN @nNextChar BETWEEN 65 AND 90 THEN dbo.RandomConsonate_NotY()
    WHEN @nNextChar IN (97,101,105,111,117) THEN LOWER(dbo.RandomVowel_NotY())
    WHEN @nNextChar BETWEEN 97 AND 122 THEN LOWER(dbo.RandomConsonate_NotY())
    ELSE @sNextChar
    END
    SET @cbProcessed = @cbProcessed + 1
    END

    RETURN @sResult

    END

  8. Pingback: Werte mit SQL unkenntlich machen – die perfekte Demo Umgebung - Willfried Färber - Blogs - triBLOG

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s