Error Logging Gone Wild

Question

Post reply

Error Logging Gone Wild

Andy Warren

SSC Guru

Points: 119747
More actions
October 10, 2016 at 9:47 pm

#314297

Comments posted to this topic are about the item Error Logging Gone Wild

Andy
SQLAndy - My Blog!
Connect with me on LinkedIn
Follow me on Twitter

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Ity Ent Old Hand Points: 387 More actions · Answer 1

Ity Ent

Old Hand

Points: 387

October 11, 2016 at 3:21 am

#1905836

Nice article!

And we should not forget Sommarskogs great articles about Error Handling http://www.sommarskog.se/error_handling/Part1.html

Thanks

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 2

akljfhnlaflkj

SSC Guru

Points: 76202

October 11, 2016 at 7:39 am

#1905892

Thanks for the article.

Eric M Russell SSC Guru Points: 125255 More actions · Answer 3

Hi Andy,

Let me run this past you.

So, the problem as I understand it is that we really don't need to be informed 1,000,000 times that (for example) an ETL process attempted to insert Feb 30, 2015 into a date column, even if the process in question was stuck in a loop and attempted 1,000,000 inserts. When it comes to logging errors, it's not as if we're counting cars at a toll booth, and all we typically need is a single golden record that can be used to quickly nail down the problem.

I have a simple solution that will cut down on the noise to signal ratio when logging errors, and this can be used for other types of logging as well such as: debug tracing or website click stream analysis.

I'll start with the ErrorLog table you presented in your article.



create table ErrorLog (

  ErrorLogID int identity (1,1) not null primary key clustered,

   ErrorDateTime datetime2 default sysdatetime(),

   ErrorID int,

   Message nvarchar(1000),

   ApplicationName varchar(50),

   ApplicationVersion varchar(50)

 );

We'll now add a non-persisted computed column, which is the checksum of: ErrorDateTime, ErrorID, and Message. This can be thought of as a hash of a natural key, and the assumption is that any rows repeating this column combination would be redundant data. It's essential that an appropriate column combination be chosen depending on the application.

Note that this solution hinges around hashing ErrorDateTime at a lower resolution. For this example I'm using DATETIMEOFFSET(0), which gives us a resolution of one second.



alter table ErrorLog add ErrorCheckSum

as checksum(

   cast(ErrorDateTime as datetimeoffset(0))

   , ErrorID

   , Message

);

We'll then create a unique index on this hash column. This solution hinges upon the usage of the option IGNORE_DUP_KEY option. Because we're limiting the number of inserts into the table and the index is on a single integer columns, this should be an efficient lookup with little overhead.



create unique index uix_ErrorLog_IgnoreDupKey

   on ErrorLog ( ErrorCheckSum )

      with (data_compression = page, ignore_dup_key = on );

Now we're done. That's all there is to it.

To test this, we will insert 100,000 rows in a tight loop.



declare @i int = 0;

while @i < 100000

begin;

   insert into ErrorLog ( ErrorID, Message, ApplicationName, ApplicationVersion )

      values ( 500, 'This is a low volume error', 'MyApp', '1.1.1.2' );

   select @i = @i + 1;

end;

On my PC, this ran for about 7 seconds, so the table ErrorLog retained a total of (7) rows. At this rate it would log about 600 errors for every 10 minutes. It's more than enough detail to track the issue within the context of time but without so much redundancy that it consumes all available storage and crashes the server the process.

select * from ErrorLog;



ErrorLogID	ErrorDateTime	ErrorID	Message	ApplicationName	ApplicationVersion	ErrorCheckSum

1	2016-10-11 09:32:26.5822773	500	This is a low volume error	MyApp	1.1.1.2	-1716654948

9918	2016-10-11 09:32:27.5022589	500	This is a low volume error	MyApp	1.1.1.2	-248222563

27356	2016-10-11 09:32:28.5022389	500	This is a low volume error	MyApp	1.1.1.2	1755245725

44501	2016-10-11 09:32:29.5022189	500	This is a low volume error	MyApp	1.1.1.2	-1008636770

61797	2016-10-11 09:32:30.5021989	500	This is a low volume error	MyApp	1.1.1.2	994831518

78581	2016-10-11 09:32:31.5021789	500	This is a low volume error	MyApp	1.1.1.2	-1831703393

96015	2016-10-11 09:32:32.5021589	500	This is a low volume error	MyApp	1.1.1.2	170192031

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Andy Warren SSC Guru Points: 119747 More actions · Answer 4

Eric, thats interesting. Thinking about it, I think its reasonable for the high volume scenario, definitely easy to implement. Low volume...say 3 a second for a few seconds, still works of course, but maybe harder to analyze/alert on? Im thinking of the dashboard view, where the counts are what I look at, usually for the last hour.

A nice alternative!

Andy
SQLAndy - My Blog!
Connect with me on LinkedIn
Follow me on Twitter