As a SQL Server developer I often need test databases that resemble my client databases as much as possible. But with the increasingly and correct, in my opinion, concern over data privacy you often need to obfuscate data. This is also called scrubbing the data. I have a common table that is used as a repository for a lot of client data. I need to write a script that would make the data obscured but keep the weight of the data there so I could continue testing. I use the following script to randomly pick from a 1000 Unicode characters to build strings to replace those customers had added. Here is the heart of it (in a loop with the length of the target string as its limit):
declare @somevalueid int;
declare @intstringlength int;
declare @intcounter int;
declare @mystring nvarchar(50);
declare @mynewstring nvarchar(50);select @mystring=mystring from tblguitar where guitarid_pk=@somevalueid
select @intstringlength= len(@mystring)while @intcounter > 0
begin
set @mynewstring=@mynewstring+nchar((cast(rand()*1000 as int))+65)
set @intcounter = @intcounter-1
end
update tblguitar set mystring=@mynewstring where guitarid_pk=@somevalueid
It should prove easy to adapt this into your code and build random strings of any length.
I could, too, restrict which Unicode or regular ASCII characters I want simply by figuring out the character range and code numbers involved. I used Unicode because it’s so big and I like the weird characters I come up with. With a large database it’s unlikely I’ll have 2 entries the same using 1000 characters; even so you may need to check if the entry already exists if you, like I, require unique entries. For those with less worry about that, here’s two examples limiting you to the normal alphabetic range:
select nchar((cast(rand()*1000 as int)%26)+97) –just a-z
select nchar((cast(rand()*1000 as int)%26)+65) –just A-Z
In the end, it’s the shape of the data that is important to me, not the actual meaning. If I can retain that shape and work on without worry that I’m compromising my client’s privacy, so much the better.
2007-10-18 at 1:47 pm |
Be careful by the way. If you copy my a-z or A-Z code above, the comment characters seem to be changed to an em dash (a long dash) so what was:
‘-’ then ‘-’ then ‘just a-z’
became
- then ‘just a-z’
and it isn’t a comment any longer
2008-01-13 at 5:51 pm |
Hey..
There are several tools in the market today that accomplish most of the tasks – Integrity constraints, mapping rules, etc., and they can also quickly create a clone of your production DB with artificially generated data. The new DB created will look exactly like your prod DB but with non-existent customer names and SSNs – although they look real. Here are some of the well known tools.
Optim
Datavantage
Camouflage
…..
2008-01-14 at 10:02 pm |
I agree those are out there but do you trust them enough to stake your job on? I will always prefer simple code that I have control over when it comes to my customer’s data.
2008-04-14 at 12:58 am |
you can actuallu use the same data and obscure it in a random manner that is irreversible. A simple code wont do though.
2009-01-26 at 3:46 pm |
I would like to do something like this that I can fire at a column in a table and work through row by row till the end substituting each character in the text field string with a random character whilst preserving spaces. I need to preserve spaces because of text wrapping in a report. I will create a translated table to hold the results. I want to create a Stored Procedure to do this where I want to pass a table name and column name. Can you provide a link to some guidance for me?
2009-01-28 at 12:05 pm |
Stephen,
To preserve spaces you would need to create a loop that parses through each character and just passes through spaces. You’d also have to restrict the pool of random characters so it doesn’t include a space.
Hope that helps.
2009-04-1 at 9:08 pm |
Many thanks!
I went a little further to randomize existing data, and keep some formatting like;
- randomizing a same case character,
- randomizing a number only, and
- retaining spaces and other characters (for phone number obfuscating).
Check out http://www.selectsystems.ca/blog/post/Randomize-Data-in-a-SQL-Table-Column.aspx.
Oh ya, it accepts a table and the field to obfuscate as inputs.
Thanks again for the post!