COALESCE Vertically??

  • I am trying to find a set based solution to retrieve the first not null value for each column in a time ordered set. Cursors and UDF are excluded for performance reasons. I considered a UDA in C# but the IsInvariantToOrder attribut is not yet supported by the query processor and UDA do not support varchar(MAX) and XML types yet.

    I have audit/history tables structured like this:

    Create

    table LogTable( PK int primary key clustered,

    UpdatedTS datetime NOT NULL,

    UpdatedBy varchar(30) NOT NULL,

    Col1 varchar(30) NULL,

    Col2 int NULL,

    ColN int NULL)

    a) Inserts are not logged

    b) only updated columns are logged

    c) a deleted row is fully logged

    This enables to have no impact on inserts and few impact on updates. The UpdatedTS and UpdatedBy columns are also in the primary table. This minimizes IO. The usual log structure (pk, date, user, old valu, new value) has too much IO overhead and must log everything as strings.

    When I am slicing the table to retrieve a row version at a specific time, for each NULL column I need to retrieve the most recent not null value (if they are all null, the value comes from the primary table since that column was never updated)

    I have been scratching my head to find a good set based solution for this. Anybody has a suggestion?

  • Something like :

    Select ColName, NewValue from dbo.LogTable LT inner join

    (Select Max(PK), ColName FROM dbo.LogTable group by ColName) dtLastVersion

    on LT.pk = dtLastVersion.PK

     

    Once you have that, have the application or calling proc figure out what missing data there is and send it out to the user.  Also you must take into consideration that a null value can be a valid VALUE and not only missing data in the history.

     

    That's why I always opt for a copy of the original table when I do a log.  That way you can retrieve the whole row in a single step.

  • Thanks for replying. I am doing a delta instead of a full copy because copying the full row would kill the server. We get thousands of tran per minutes. Some tables are wide and have LOBs in them.

    Your solution does not work (the way I interpret it). The PK of the row doesn't change, just the timestamp. The MAX(PK) makes event less sense to me. I am not trying to output an OldValue/NewValue but a complete row for each point in time.

    I've tried several approaches, but nothing truly elegant yet. My current experiment is to use a recursive CTE to fully materialize each time slice, but I am still struggling with it. A bit of challenge to brigthen the day

     

     

  • ok here's a modified version :

     

    Select ColName, Value from dbo.LogTable LT inner join

    (Select Max(UpdatedTS), ColName FROM dbo.LogTable where Pk = @pk group by ColName) dtLastVersion

    on LT.UpdatedTS = dtLastVersion.UpdatedTS and LT.PK = @pk

     

    This will give you something like :

    LName = John

    FName = Smith

    ...

     

    You'll have to pivot that into a single row to have the latest version.  But then again you might have missing columns in there because they might not have been updated at all since the insert.

  • ... and if you're trying to rebuild the whole history of the row then I have no elegant solution to offer.  You might be able to pivot the changes on the date to know all the columns that were updated in a single shot.  But atfter that, without a starting point you'll have to go back in time to rebuild the history backwards from the current data.  I can't think of a simple solution for this one.

  • I don't know how you guys can work through these things without examples. I find it impossible .

    RGR - I don't get what you're getting at - could you post it with some example data. I've invented some below if that helps.

    lakusha - Interesting problem! Is the example data I've invented sufficient to illustrate your problem, and is the target I've given correct? If not, would you please provide an example and the target results.

    I've come up with 2 ways. The first is highly inelegant and inefficient, I'd say, and the second is semi-elegant . i.e. neither is ideal!

    Take a look and see what you think...

    --data

    declare @LogTable table ( PK int primary key clustered, UpdatedTS datetime NOT NULL,

        UpdatedBy varchar(30) NOT NULL, Col1 varchar(30) NULL, Col2 int NULL, ColN int NULL)

    insert @LogTable

              select 1, '20060101', 'A', null, 1, null

    union all select 2, '20060102', 'A', null, 2, null

    union all select 3, '20060103', 'A', 'C', null, 4

    union all select 4, '20060104', 'A', 'B', null, null

    --target: 'B', 2, 4

    --calculation (one idea)

    select

      (select Col1 from @LogTable where UpdatedTS = (select max(UpdatedTS) from @LogTable where Col1 is not null)),

      (select Col2 from @LogTable where UpdatedTS = (select max(UpdatedTS) from @LogTable where Col2 is not null)),

      (select ColN from @LogTable where UpdatedTS = (select max(UpdatedTS) from @LogTable where ColN is not null))

    --calculation (another idea)

    declare @Col1 varchar(30), @Col2 int, @ColN int

    select

      @Col1 = case when @Col1 is null and Col1 is not null then Col1 else @Col1 end,

      @Col2 = case when @Col2 is null and Col2 is not null then Col2 else @Col2 end,

      @ColN = case when @ColN is null and ColN is not null then ColN else @ColN end

    from @LogTable

    order by UpdatedTS desc

    select @Col1, @Col2, @ColN

     

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Sorry, emergency time here... no more time on ssc this week...

  • Sorry about the lack of examples Ryan. I posted the table structure but I had so many alternatives I had tried it would have been confusing. Some of them worked but were, urg, inellegant

    I now have something that works and seems **much** faster. I am using COALESCE in a recursive CTE. My schema is way to bug to post as an example, so I will build a small example and post it if you are still interrested.

    Sometimes SQL is just plain fun

  • Of course I'm interested!

    Not that I have access to sql 2005 yet, but still, I am interested

    Did the example I gave illustrate your problem?

    And were the techniques of any interest? Well the 2nd one anyway...

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  •  

    Ryan,

    In did not forget the example. While coding it I discovered a flaw in my logic (I can end up with a data anomaly with the "first" row, i.e. the one returned by the anchor of the CTE). I tried to fix it and even tried doing the recursivity backward in time instead of forward before realizing it was foolish.

    I spent Friday at 35000 thousand feets and had food poisonning from a sushi restaurant at the Detroit airport

    I will look at the problem again Sunday.

    L

  • Here it goes:

    /*

    For performance reasons, I am using a RECVER (record version) column instead of the update timestamp.

    That column exists to support optimistic concurrency across n database vendors

    It is always updated by the application and this way I don't have to use a datetime in the PK of the log table (which is a bad practice)

    We can get rid of RECVER and use UpdatedTime instead by using a boundary query but I choose to stick to RECVER instead

    */

     

    use

    master

    if

    exists(Select * from sys.databases where [name] = N'LoggingExample')

    Drop database LoggingExample

    GO

    Create

    database LoggingExample

    GO

    Use

    LoggingExample

    GO

    Create

    table ProductionTable(

    PK

    int NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    UpdatedBy

    varchar(20),

    UpdatedTime

    datetime,

    Col1

    varchar(30) NOT NULL,

    Col2

    varchar(30) NOT NULL,

    Col3

    int NOT NULL,

    RECVER

    int NOT NULL)

    GO

    Create

    table LogTable(

    PK

    int NOT NULL,

    UpdatedBy

    varchar(20) NOT NULL,

    UpdatedTime

    datetime NOT NULL,

    Col1

    varchar(30) NULL,

    Col2

    varchar(30) NULL,

    Col3

    int NULL,

    RECVER

    int,

    CONSTRAINT [PK_LogTable] PRIMARY KEY CLUSTERED

    (

    [PK]

    ASC,

    [RECVER]

    ASC

    )

    )

    GO

    -- this one is used for debugging only

    Create

    table ProofTable(

    PK

    int NOT NULL,

    UpdatedBy

    varchar(20) NOT NULL,

    UpdatedTime

    datetime NOT NULL,

    Col1

    varchar(30) NULL,

    Col2

    varchar(30) NULL,

    Col3

    int NULL,

    RECVER

    int,

    CONSTRAINT [PK_ProofTable] PRIMARY KEY CLUSTERED

    (

    [PK]

    ASC,

    [RECVER]

    ASC

    )

    )

    GO

    /*

    Two triggers to audit the table:

    only updates and deletes are tracked

    we log only the delta (columns that have change) in the log table

    we log everything in the profftable to help debug

    */

    Create

    Trigger TRINS_ProductionTable on [ProductionTable]

    For

    Update

    as

    SET

    NOCOUNT ON

    Insert

    into LogTable(PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)

    Select

    I.PK,

    D

    .UpdatedBy,

    D

    .UpdatedTime,

    NULLIF(D.Col1, I.Col1),

    NULLIF(D.Col2, I.Col2),

    NULLIF(D.Col3, I.Col3),

    D

    .RECVER

    From

    Inserted as

    inner

    join Deleted as [D] on I.PK = D.PK;

    -- Now log a second time to help testing

    Insert

    into [ProofTable](PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)

    Select

    PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER From Deleted;

    GO

    Create

    Trigger TRIDEL_ProductionTable on [ProductionTable]

    For

    Delete

    as

    SET

    NOCOUNT ON

    Insert

    into LogTable(PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)

    Select

    PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER From Deleted;

    -- Now log a second time to help testing

    Insert

    into [ProofTable](PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)

    Select

    PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER From Deleted;

    GO

    --TRUNCATE TABLE [ProductionTable]

    --TRUNCATE TABLE [LogTable]

    --TRUNCATE TABLE [ProofTable]

    /*

    Insert a row and modify it a few times

    real test is done with a few million rows

    */

    Declare @pk int

    Insert into ProductionTable (UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)

    Select 'Bob',getdate(),'First Value','Some thing...',77, 1

    Select @pk = SCOPE_IDENTITY()

    Update ProductionTable

    Set UpdatedBy = 'John',

    UpdatedTime

    = dateadd(mi,10,getdate()),

    Col1

    = 'First Value',

    Col2

    = 'Some thing...',

    Col3

    = 78, -- CHANGE

    RECVER

    = RECVER + 1

    Where PK = @pk

    --

    Update ProductionTable

    Set UpdatedBy = 'Billy2',

    UpdatedTime

    = dateadd(mi,20,getdate()),

    Col1

    = 'Second Value', -- CHANGE

    Col2

    = 'Something nice...', -- CHANGE

    Col3

    = 78,

    RECVER

    = RECVER + 1

    Where PK = @pk

    Update ProductionTable

    Set UpdatedBy = 'Billy3',

    UpdatedTime

    = dateadd(mi,30,getdate()),

    Col1

    = 'Second Value',

    Col2

    = 'Something nice...',

    Col3

    = 78,

    RECVER

    = RECVER + 1

    Where PK = @pk

    -- CHANGE

    Update ProductionTable

    Set UpdatedBy = 'Billy4',

    UpdatedTime

    = dateadd(mi,40,getdate()),

    Col1

    = 'Second Value',

    Col2

    = 'Something nice...',

    Col3

    = 79, -- CHANGE

    RECVER

    = RECVER + 1

    Where PK = @pk

    GO

    ----------------------physical state of the log------------------------

    --- Notice the NULLs we want to fill

    -- you can visualize it by playing Tetris with the resultset:

    -- we must push the values down in the NULL cells

    Select

    PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

    From

    ProductionTable

    UNION

    ALL

    Select

    PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

    From

    LogTable

    Order

    By RECVER desc

    ---------------------Log we want to expose--------------------

    Select

    PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

    From

    ProductionTable

    UNION

    ALL

    sELECT

    * FROM [ProofTable]

    Order

    By RECVER desc

    -------------------- Backward in time ------------------------

    -- COALESCE Vertically by pushing values down the stack

    -- CTEs can be changed into VIEWS if necessary

    With

    TimeLog as

    (

    Select PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

    From ProductionTable

    UNION ALL

    Select PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

    From LogTable

    ),

    TimeSlice

    as

    (

    Select PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

    From TimeLog as [t1]

    Where RECVER = (Select MAX(RECVER) From TimeLog t2 where t2.PK = t1.PK)

    UNION ALL

    Select TimeSlice.PK,

    L

    .UpdatedBy, L.UpdatedTime,

    COALESCE(L.Col1, TimeSlice.Col1),

    COALESCE(L.Col2, TimeSlice.Col2),

    COALESCE(L.Col3, TimeSlice.Col3),

    L

    .RECVER

    From TimeLog as [L]

    join TimeSlice on L.PK = TimeSlice.PK and L.RECVER = (TimeSlice.RECVER-1)

    )

    Select

    TS.PK,

    TS

    .UpdatedBy, TS.UpdatedTime,

    COALESCE(TS.Col1, TS2.Col1) as [Col1],

    COALESCE(TS.Col2, TS2.Col2) as [Col2],

    COALESCE(TS.Col3, TS2.Col3) as [Col3],

    TS

    .RECVER

    from

    TimeSlice TS

    left

    join TimeSlice TS2 on TS.PK = TS2.PK and TS.RECVER = (TS2.RECVER-1)

    and TS2.RECVER=2 and TS.RECVER=1

    Order

    By RECVER desc;

  • Thanks lakusha. I'll have to wait till we get SQL 2005 to be able to work through this, but it makes interesting reading in the meantime! 

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Care to expand on this idea???

     

    "...I don't have to use a datetime in the PK of the log table (which is a bad practice)"

  •  

    Datetime is an imprecise datatype. You always have +-3ms.

    Try this:

    Select cast('2006-05-15 23:59:59.999' as datetime)

    And it happens more often then you think. The last digit is always rounded by SQL to 0, 3 or 7. That means that the data you read can differ from the one you inserted.

    It also mean that two row inserted sequencially can end up with the exact same value. In a PK that means an access violation and in application logic it means it can screw up the sequence of events in your log.

  • Now that you are looking for performance, Allow me to also say that your triggers should check for different values and only then insert the change in the log.

    The reason for the PK issues is that the PK of the the log table should be TRANSACTIONID instead of PK and datetime. It is, as you've already discovered not a good idea.

    I my opinion all this pivoting should be done client side instead but if you managed to do it with a CTE it should be fine

    Cheers,

      


    * Noel

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply