Error Logging Gone Wild

  • Comments posted to this topic are about the item Error Logging Gone Wild

  • Nice article!

    And we should not forget Sommarskogs great articles about Error Handling http://www.sommarskog.se/error_handling/Part1.html

    Thanks

  • Thanks for the article.

  • Hi Andy,

    Let me run this past you.

    So, the problem as I understand it is that we really don't need to be informed 1,000,000 times that (for example) an ETL process attempted to insert Feb 30, 2015 into a date column, even if the process in question was stuck in a loop and attempted 1,000,000 inserts. When it comes to logging errors, it's not as if we're counting cars at a toll booth, and all we typically need is a single golden record that can be used to quickly nail down the problem.

    I have a simple solution that will cut down on the noise to signal ratio when logging errors, and this can be used for other types of logging as well such as: debug tracing or website click stream analysis.

    I'll start with the ErrorLog table you presented in your article.

    create table ErrorLog (

    ErrorLogID int identity (1,1) not null primary key clustered,

    ErrorDateTime datetime2 default sysdatetime(),

    ErrorID int,

    Message nvarchar(1000),

    ApplicationName varchar(50),

    ApplicationVersion varchar(50)

    );

    We'll now add a non-persisted computed column, which is the checksum of: ErrorDateTime, ErrorID, and Message. This can be thought of as a hash of a natural key, and the assumption is that any rows repeating this column combination would be redundant data. It's essential that an appropriate column combination be chosen depending on the application.

    Note that this solution hinges around hashing ErrorDateTime at a lower resolution. For this example I'm using DATETIMEOFFSET(0), which gives us a resolution of one second.

    alter table ErrorLog add ErrorCheckSum

    as checksum(

    cast(ErrorDateTime as datetimeoffset(0))

    , ErrorID

    , Message

    );

    We'll then create a unique index on this hash column. This solution hinges upon the usage of the option IGNORE_DUP_KEY option. Because we're limiting the number of inserts into the table and the index is on a single integer columns, this should be an efficient lookup with little overhead.

    create unique index uix_ErrorLog_IgnoreDupKey

    on ErrorLog ( ErrorCheckSum )

    with (data_compression = page, ignore_dup_key = on );

    Now we're done. That's all there is to it.

    To test this, we will insert 100,000 rows in a tight loop.

    declare @i int = 0;

    while @i < 100000

    begin;

    insert into ErrorLog ( ErrorID, Message, ApplicationName, ApplicationVersion )

    values ( 500, 'This is a low volume error', 'MyApp', '1.1.1.2' );

    select @i = @i + 1;

    end;

    On my PC, this ran for about 7 seconds, so the table ErrorLog retained a total of (7) rows. At this rate it would log about 600 errors for every 10 minutes. It's more than enough detail to track the issue within the context of time but without so much redundancy that it consumes all available storage and crashes the server the process.

    select * from ErrorLog;

    ErrorLogID ErrorDateTime ErrorID Message ApplicationName ApplicationVersion ErrorCheckSum

    1 2016-10-11 09:32:26.5822773 500 This is a low volume error MyApp 1.1.1.2 -1716654948

    9918 2016-10-11 09:32:27.5022589 500 This is a low volume error MyApp 1.1.1.2 -248222563

    27356 2016-10-11 09:32:28.5022389 500 This is a low volume error MyApp 1.1.1.2 1755245725

    44501 2016-10-11 09:32:29.5022189 500 This is a low volume error MyApp 1.1.1.2 -1008636770

    61797 2016-10-11 09:32:30.5021989 500 This is a low volume error MyApp 1.1.1.2 994831518

    78581 2016-10-11 09:32:31.5021789 500 This is a low volume error MyApp 1.1.1.2 -1831703393

    96015 2016-10-11 09:32:32.5021589 500 This is a low volume error MyApp 1.1.1.2 170192031

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric, thats interesting. Thinking about it, I think its reasonable for the high volume scenario, definitely easy to implement. Low volume...say 3 a second for a few seconds, still works of course, but maybe harder to analyze/alert on? Im thinking of the dashboard view, where the counts are what I look at, usually for the last hour.

    A nice alternative!

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply