Stopping Duplicates with INSTEAD OF INSERT Problem

  • We have a table which gets thousands of new records per hour from a web app. This app needs to generate GUIDs and then if the GUID doesn't exist in the table, it needs to insert it. The problem is that we were seeing large volumes of duplicates in the table, so we created a trigger which does an INSTEAD OF INSERT on the table and checks that the key doesn't already exist. However we still get a few duplicates a day and I can't tell why, this should be rock solid..most all of the duplicates are a few milliseconds after the originals so I'm wondering if there is some timing thing that we could do to fix it. I'd prefer to use a unique constraint on the table, but we are partitioning the table and we partition on the created time field, and since the partition key id needs to be in every index (for our archiving purposes) it doesn't work since the time is different between the two records.

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TRIGGER [dbo].[TriggerName]

    ON [dbo].[TableName]

    INSTEAD OF INSERT

    AS

    BEGIN

    -- SET NOCOUNT ON added to prevent extra result sets from

    -- interfering with SELECT statements.

    SET NOCOUNT ON;

    INSERT INTO TableName

    SELECT *

    FROM inserted

    WHERE NOT EXISTS (

    SELECT TableID

    FROM TableName

    WHERE TableID = inserted.TableID

    )

    END

  • Eric, in the absence of sample data, I'm going to hazard an opinion. You have a join at some point that is causing multiple rows with the same GUID in the result set that is being inserted as a single insert. Testing for your NOT EXISTS is not done on a row by row basis after each insert, it is only done once before the rows are inserted. If there are duplicate GUIDs in the INSERTED table, your code is not going to prevent the extra rows from being inserted.

    To solve your problem, you should figure out where and how the extra rows are being generated. Think about this: GUIDS are intended to be extremely unique within your system. Even with thousands being generated an hour, what are the odds that duplicates just happen to be occurring? Extremely low. Yet this is an apparent and recurring problem for you. An unexpected generation of multiple rows is far more likely.

    I am deliberately avoiding suggesting SELECT DISTINCT as a way to solve this in your trigger. SELECT DISTINCT will only work if the rows are exact duplicates. It is entirely possible to generate multiple rows with the same GUID but with other differences such as dates.

    If you want, I can generate a code sample that shows you how this can happen. But you are going to have to study your code that does the original insert to find the heart of your problem. Good hunting, and let us know if you have further questions.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • I took a look at the insert proc, but it is simply doing a standard insert into (...) values (...) with the values being passed in as variables

    Perhaps I was simplifying my original post too much, the way our system works is that an application is downloaded from our website and the application generates the GUID (not the way I want it to work, but thats the way it's setup and I have to deal with it), so the system isn't generating the GUID, the App is. When the Application calls home the first time it sends the GUID to us, we then insert the GUID and some additional info into the table in question if the GUID doesn't already exist in the table. The Trigger we created should catch any duplicates.

    I'll look more into the process that the Application uses when it calls home to see why it might send multiple calls to the insert proc, but regardless of the reason why it might, the insert is a straight insert and the trigger should catch the second record and prevent it from being inserted.

    Thanks for your comment, any other thoughts you have are also appreciated.

  • yeah the trigger should be doing a left outer join to test if the row exists, like this instead:

    INSERT INTO TableName

    SELECT *

    FROM inserted

    LEFT OUTER JOIN TableName

    ON inserted.TableID = TableName.TableID

    WHERE TableName.TableID IS NULL

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell,

    Thanks for the reply, I'd like to better understand why that solution works and the NOT EXISTS doesn't, it seems to me that doing an EXISTS check on the table should work just as well.

    Thanks.

  • There is nothing wrong with the NOT EXISTS syntax that the LEFT JOIN is going to cure. That's just another technique for doing the same non-existence test. The problem remains that if duplicate GUIDS exist in the INSERTED table, only one test will be done to see if that GUID is already in the target table. I will code you an example.

    Edited to add:

    Gail Shaw has an excellent write-up on LEFT JOIN/NOT EXISTS on her website right now.

    http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/[/url]

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • OKAY... once more from the top. In the example below, there are either one or two rows in the source table, but each have the same GUID for [TableID]. Run it first with the UNION ALL and second SELECT commented out. This demonstrates that the trigger is firing and taking control of the insertion.

    Next, run it with the UNION ALL / SELECT lines uncommented. The following things happen:

    1. The INSTEAD OF trigger fires and takes control of the process.

    2. The NOT EXISTS test is run to see if the TABLEID already exists in the (empty) target table.

    3. No match is found and an insert OF BOTH source rows is attempted as a single transaction.

    4. The insert fails due to a primary key violation, and NEITHER row is inserted in the target table.

    Please test this and let me know if you have any questions about 2 and 3 above. This is really a very important concept to understand.

    if object_id(N'dbo.testSource') is not null DROP TABLE dbo.testSource

    if object_id(N'dbo.testTarget') is not null DROP TABLE dbo.testTarget

    DECLARE @ID UNIQUEIDENTIFIER

    CREATE TABLE dbo.testSOURCE (TABLEID UNIQUEIDENTIFIER,

    CUSTNO INT,

    AMOUNT NUMERIC(9,2),

    TransDate DATETIME,

    Comments VARCHAR(MAX))

    CREATE TABLE dbo.testTARGET (TABLEID UNIQUEIDENTIFIER PRIMARY KEY,

    CUSTNO INT,

    AMOUNT NUMERIC(9,2),

    TransDate DATETIME,

    Comments VARCHAR(MAX))

    SET @ID = NEWID()

    INSERT INTO dbo.testSOURCE

    SELECT @ID, 120, 100.00, '1/1/2010','No Comment'

    --uncomment the next two lines to test effectiveness of trigger against duplicate keys in source data

    --union all

    --SELECT @ID, 120, 200.00, '1/1/2010','No Comment'

    -- note that the two rows are NOT identical and would pass a SELECT DISTINCT

    GO

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TRIGGER [dbo].[testTrigger]

    ON [dbo].[testTarget]

    INSTEAD OF INSERT

    AS

    BEGIN

    -- SET NOCOUNT ON added to prevent extra result sets from

    -- interfering with SELECT statements.

    SET NOCOUNT ON;

    INSERT INTO dbo.testTarget (TABLEID,CUSTNO,AMOUNT,TransDate,Comments)

    SELECT TABLEID,CUSTNO,AMOUNT,TransDate,'Trigger Worked'

    FROM inserted

    WHERE NOT EXISTS (

    SELECT TableID

    FROM dbo.TestTarget

    WHERE TableID = inserted.TableID

    )

    END

    GO

    insert into dbo.TestTarget(TABLEID,CUSTNO,AMOUNT,TransDate,Comments)

    select TABLEID,CUSTNO,AMOUNT,TransDate,Comments from dbo.TestSource

    GO

    select 'Source' as [Source],* from dbo.TestSource

    select 'Target' as [Target],* from dbo.TestTarget

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (3/24/2010)


    OKAY... once more from the top. In the example below, there are either one or two rows in the source table, but each have the same GUID for [TableID]. Run it first with the UNION ALL and second SELECT commented out. This demonstrates that the trigger is firing and taking control of the insertion.

    Next, run it with the UNION ALL / SELECT lines uncommented. The following things happen:

    1. The INSTEAD OF trigger fires and takes control of the process.

    2. The NOT EXISTS test is run to see if the TABLEID already exists in the (empty) target table.

    3. No match is found and an insert OF BOTH source rows is attempted as a single transaction.

    4. The insert fails due to a primary key violation, and NEITHER row is inserted in the target table.

    Please test this and let me know if you have any questions about 2 and 3 above. This is really a very important concept to understand.

    I understand what you are saying, and I verified that your code does what it says it does. But this is doing something different than what my system is doing (per my previous statement).

    The insert proc is the only thing inserting rows into the table and the proc does this:

    BEGIN TRAN

    INSERT INTO

    dbo.TableName

    (

    TableID --GUID

    , CreatedTime

    , OsId

    , DefaultBrowserVersionId

    , IEVersionId

    , [LangId]

    )

    VALUES

    (

    @TableID

    ,GETDATE()

    ,@OsId

    ,@DefaultBrowserVersionId

    ,@IEVersionId

    ,@LangId

    )

    COMMIT TRAN

    So there isn't a chance that this is causing two rows to be inserted, unless the proc is called twice. I'm checking with the web app developers to see why the proc might be called twice, but regardless, I believe the transaction around it coupled with the trigger should eliminate all duplicates...unless the transaction is causing a problem I am not thinking of???

    But it obviously is allowing duplicates, so any further help you (or anyone) could provide would be much appreciated.

    Thank you.

  • Okay... I obviously misunderstood your issue. I agree that a single insert with a VALUES clause like that shouldn't create two rows. (This is where seeing source data and actual vs expected results is most helpful.) Let's start over with a few questions for clarification.

    1. Does the target table for the insertions have a primary key, or a unique clustered index on it?

    2. Is the [TableID] column (the GUID) supposed to have a different value each time it is sent?

    (The name "TableID" suggests something that designates an object, and could be repeated.)

    3. Do other triggers exist on the target table other than the one you showed in your first post?

    Dont' worry. Other volunteers won't hesitate to join in.

    EDITED TO ADD:

    4. Why aren't you doing the NOT EXISTS test in the procedure itself, rather than creating this trigger?

    (You would have to switch to a SELECT syntax rather than use VALUES, but that's simple enough. Alternatively, you could use a TRY/CATCH block to handle PK violations... assuming you have a primary key defined on the target.)

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • I just used your procedure's BEGIN/INSERT/VALUES/COMMIT logic to insert multiple rows with the same GUID.

    The trigger worked just as you intended it should.

    We are missing some information. Please let us know what you find out.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (3/24/2010)


    Okay... I obviously misunderstood your issue. I agree that a single insert with a VALUES clause like that shouldn't create two rows. (This is where seeing source data and actual vs expected results is most helpful.) Let's start over with a few questions for clarification.

    1. Does the target table for the insertions have a primary key, or a unique clustered index on it?

    2. Is the [TableID] column (the GUID) supposed to have a different value each time it is sent?

    (The name "TableID" suggests something that designates an object, and could be repeated.)

    3. Do other triggers exist on the target table other than the one you showed in your first post?

    Dont' worry. Other volunteers won't hesitate to join in.

    EDITED TO ADD:

    4. Why aren't you doing the NOT EXISTS test in the procedure itself, rather than creating this trigger?

    (You would have to switch to a SELECT syntax rather than use VALUES, but that's simple enough. Alternatively, you could use a TRY/CATCH block to handle PK violations... assuming you have a primary key defined on the target.)

    1) I did a script > Create as:

    CREATE TABLE [dbo].[TableName](

    [TableID] [uniqueidentifier] NOT NULL,

    [CreatedTime] [datetime] NOT NULL,

    [OsId] [int] NULL,

    [DefaultBrowserVersionId] [int] NULL,

    [IEVersionId] [int] NULL,

    [defaultbrowserid] [int] NULL,

    [LangId] [int] NULL,

    CONSTRAINT [PK_TableName] PRIMARY KEY NONCLUSTERED

    (

    [TableID] ASC,

    [CreatedTime] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [MonthlyPartition_SC01]([CreatedTime])

    ) ON [MonthlyPartition_SC01]([CreatedTime])

    GO

    CREATE CLUSTERED INDEX [TableName_I01] ON [dbo].[TableName]

    (

    [CreatedTime] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [MonthlyPartition_SC01]([CreatedTime])

    GO

    CREATE TRIGGER [dbo].[TableName_TI01]

    ON [dbo].[TableName]

    INSTEAD OF INSERT

    AS

    BEGIN

    -- SET NOCOUNT ON added to prevent extra result sets from

    -- interfering with SELECT statements.

    SET NOCOUNT ON;

    INSERT INTO TableName

    SELECT *

    FROM inserted

    WHERE NOT EXISTS (

    SELECT TableID

    FROM TableName

    WHERE TableID = inserted.TableID

    )

    END

    *Note that for our Partitioning to work for monthly archiving of old data, we need to have the CreatedTime column be in every index (for the create new table swap partition and drop new table process), so that means that pretty much every row should be unique as the time should change for each row.

    2) The GUID column should have a different GUID for each download of the application, but I suppose it is possible that if a consumer re-installed it it would generate a new GUID if it doesn't find one in the Registry.

    3) No, just the one above

    4) I asked that very question to the developers in an email this evening, but havne't heard back yet...not sure if there is a valid reason, but I'd rather catch the duplicate in the insert proc than hit the table in an insert and catch it in the trigger...

  • The duplicates you mention: Are they simply duplicates of the [TableID], or is the [createdTime] also identical. Since it's a datetime field it would have to be identical down to the millisecond. However, that still wouldn't explain why a NOT EXISTS test on [TableID] only isn't working.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (3/24/2010)


    The duplicates you mention: Are they simply duplicates of the [TableID], or is the [createdTime] also identical. Since it's a datetime field it would have to be identical down to the millisecond. If it isn't, that's the problem right there, because the VALUES clause uses getdate() to populate that column, and that will be different every time the procedure is executed.

    I ran this:

    SELECT *

    FROM dbo.tablename WITH (NOLOCK)

    WHERE tableid = '7277D2DB-2D50-4FE4-888E-8E7C9261B256'

    and got this:

    TableId CreatedTime OsId DefaultBrowserVersionId IEVersionId defaultbrowserid LangId

    7277D2DB-2D50-4FE4-888E-8E7C9261B256 2010-03-24 02:40:12.860 NULL 4 NULL NULL NULL

    7277D2DB-2D50-4FE4-888E-8E7C9261B256 2010-03-24 02:40:12.863 NULL 4 NULL NULL NULL

    And in all but one of the duplicates I looked over the past two day the difference was .003 seconds apart

    So since they are all different times, we can assume the proc is firing more than once, but if thats the case, I don't know why the tirgger doesn't catch it.

  • Next question: If the objective is to prevent two rows from having the same TableID, why isn't the primary key on [TableID] only? That would put a hard stop to the problem.

    Something is still amiss here. I'd love to see the proc in it's entirety, but I have to get some sleep. I'll check back in the morning. Maybe someone else can offer some insight in the meanwhile.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (3/24/2010)


    Next question: If the objective is to prevent two rows from having the same TableID, why isn't the primary key on [TableID] only? That would put a hard stop to the problem.

    Something is still amiss here, but I have to get some sleep. I'll check back in the morning. Maybe someone else can offer some insight in the meanwhile.

    Because of the partition setup, we need CreatedTime in the partition because that is the field we have the partition setup on. Each month we run a script which creates a new table with identical indexes, swaps the partition between the two tables, BCP's out the data, then drops the new staging table. We are required to have the Partition key (CreatedTime) in every index (at least that is my understanding of the way partitioning works).

Viewing 15 posts - 1 through 15 (of 37 total)

You must be logged in to reply to this topic. Login to reply