Replace real surnames and forenames with random selections

  • I need to be able to replace existing surnames and forenames in a data table with names selected at random from forename and surname tables I have created from list of surnames and forenames made available by the US Census Bureau. I have a large volume of data, so I would like to be able to create an update query that does it for me in one go if possible. This is so I can give test data to an offshore developer without contravening data protection legislation.

    The scenario is as follows:

    Main data table: Subscriber

    ID - int, Identity(1,1)

    Surname Varchar(15)

    Forename Varchar(15)

    plus many other fields that are not relevant

    500,000+ records

    Surname Table: Surname

    ID int, Identity (1,1)

    Surname Varchar(15)

    89,000+ records

    Forename Table: Forename

    ID int Identity (1,1)

    Forename Varchar(15)

    2,500+ Records

    I want to replace each Surname in the Subscriber table with a randomly selected record from the Surname table, and replace each forename with a randomly selected record from the Forename table, ideally without resorting to a cursor. e.g John Smith might become Francisco Gotti, Mary Williams becomes Albert Jones etc - I do not have to match name genders

    All suggestions gratefully received.

  • i believe this will do what you want...obviously test this first!

    --random first land last names,

    ;With MyRandomNamePairs AS

    ( SELECT

    ROW_NUMBER() OVER(ORDER BY Surname.ID) AS N,

    Forename.Forename, --column name same as tablename? yuck!

    Surname.Surname --column name same as tablename? yuck!

    FROM Forename

    CROSS JOIN Surname )

    ,CTERANDOMIZEDDATA AS

    (

    SELECT Subscriber.ID,

    MyRandomNamePairs.Forename,

    MyRandomNamePairs,Surname,

    ROWNUM = ROW_NUMBER() OVER (PARTITION BY Subscriber.ID ORDER BY NEWID())

    FROM Subscriber --THE TABLE WITH THE KEY

    CROSS JOIN MyRandomNamePairs --THE TABLE WITH THE RANDOM VALUE

    )

    UPDATE Subscriber --THE JOINING TABLE

    SET Subscriber.Forename = CTERANDOMIZEDDATA.Forename, --THE RANDOM VALUE

    Subscriber.Surname = CTERANDOMIZEDDATA.Surname --THE RANDOM VALUE

    FROM CTERANDOMIZEDDATA

    WHERE Subscriber.ID=CTERANDOMIZEDDATA.ID --FOR EACH KEY

    AND ROWNUM = 1 --LIMITS TO ONE ARBITRARY VALUE DUE TO THE ORDER BY NEWID

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell - thanks for your response

    With the exception of a small typo (a comma in MyRandomNamePairs,Surname instead of a period), it does appear to work, but the performance is not good - I restricted it to updating the first 100 records in the subscription table by adding where id < 101 to the selection criteria in the update statement, and it was still executing after about 6 minutes. I suspect this might be down to the two cross joins.

    I tried a different tack, selecting random surnames and forenames independently using a couple of functions and a view:

    CREATE VIEW [dbo].[GetNewID]

    AS

    SELECT NewId() AS [NewID]

    Create FUNCTION [dbo].[GetRandomForename]()

    RETURNS varchar(15)

    AS

    BEGIN

    RETURN (

    SELECT TOP 1 Forename

    FROM dbo.Forenames --The table and column now have different names!

    ORDER BY (SELECT [NewId] FROM GetNewID)

    )

    END

    CREATE FUNCTION [dbo].[GetRandomSurname]()

    RETURNS nvarchar(15)

    AS

    BEGIN

    RETURN (

    SELECT TOP 1 Surname --The table and column now have different names!

    FROM dbo.Surnames

    ORDER BY (SELECT [NewId] FROM GetNewID)

    )

    END

    I then ran the update query below - this updated 100 records in 21 secs, and 500 in 1m 56s

    UPDATE dbo.subscription

    SET dbo.subscription.Forename = dbo.GetRandomForename(),

    dbo.subscription.Surname = dbo.GetRandomSurname()

    where id < 501

  • a lot depends on your data; that cross join can have a cost.

    if your data is anything like my copies of the US Census to 90% firstnames/lastnames, a cross join makes a massive 108 Million+ rows;

    Census_dist_Male--1219 first names

    Census_dist_Last --88799 last names

    88799 * 1219 = 108,245,981

    myself, i use a smaller subset; top 20 firstnames, and 20 last names, and i don't mind if the names get repeated every 400 records; in that way the performance is a bit better.

    seperate updates of firstname, then lastname, without a cross join can perform better than trying to generate firstname + lastname randomly together.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell

    I don't want too many repeats, so I'm using the full census tables.

    Using the two seperate functions, it updated 2500 rows in 12 minutes - for 500,000 records I will probably leave it running overnight

    Many thanks for your suggestions

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply