deduping data in a flat file

  • I've been given a data file to load into my main database which has quite a few duplicates in it.  The problem is that it's got no unique identifiers in it just address lines 1-6 and a postcode field.

    I'm planning on doing a dedupe on address line 1 and postcode.  The only problem now is I don't know how to write the dedupe.  I've only ever deduped using an int field before.  Could anyone please help me?

  • First, load into a staging table.   Then you can use code like this:

    -- Delete duplicates from mytable

    alter table mytable

    add id [int] IDENTITY (1, 1) NOT NULL

    go

    if exists (select * from dbo.sysobjects where id = object_id(N'[deleted_from_mytable]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)

    drop table [deleted_from_mytable]

    GO

    --save deleteds into temp table just in case

    select c.* into deleted_from_mytable

    From mytable c (nolock)

    JOIN mytable c1 (nolock)

    on c.[col1]+c.[col4] = c1.[col1]+c1.[col4]

    where c.[id] > c1.[id]

    go

    --now delete the duplicates

    delete c

    From mytable c (nolock)

    JOIN mytable c1 (nolock)

    on c.[col1]+c.[col4] = c1.[col1]+c1.[col4]

    where c.[id] > c1.[id]

    go

    ALTER TABLE mytable DROP COLUMN [id]

     

    Bill

  • I'd agree with the above. Load into SQL and then dedup.

  • Thanks this is brilliant.

     

    All my previous dedupes I've done have involved doing loops and therefore took a while, I knew a self join might be a way forward but never got it to work.

  • If you have room for two copies of the data, I think this might run faster than the self-join and DELETE method.

    First import the data into a staging table.

    Then create another staging table with the same structure, and put a unique constraint on the dedupe fields using WITH IGNORE_DUP_KEY.

      SELECT * INTO StageUnique FROM Staging WHERE 1=0

      CREATE UNIQUE INDEX IX_StageUnique ON StageUnique (Addr1, PostCode)

        WITH IGNORE_DUP_KEY

    Now copy the data into the second table.  All duplicates are rejected, which is faster than locating and deleting them later.

      INSERT INTO StageUnique  SELECT * FROM Staging ORDER BY Addr1, PostCode

    You can add to the ORDER BY clause if you have a preference for which record among a set of duplicates is retained.  For instance, you may want to keep records where Addr2 is defined if there are other records where it is missing.

      INSERT INTO StageUnique  SELECT * FROM Staging

      ORDER BY Addr1, PostCode, CASE WHEN ISNULL(Addr2,'') = '' THEN 1 ELSE 0 END

     

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply