Blog Post

How to use Change Data Capture (CDC)

,

With SQL Server 2008 Microsoft added the CDC feature to SQL Server enterprise edition. I haven’t come to play around with it before, but the other day I was asked if CDC might be used for a specific purpose. Because I had no previous experience using CDC, I had to check it out.

The Quest

The Quest was to be able to provide a system with undo/redo capabilities, so if a user had entered (and committed) incorrect data, it should be possible to rollback to any given time before that happened. Backup/restore wasn’t an option, because only a specific subset of data should be rolled back (let’s say a specific UserId)

The idea

The idea was to either build a tracking system manually using triggers, or to look into the CDC possibilities. The idea I will try out with this small demo, is to use CDC to read the data from a previous point in time, and then use the MERGE statement to revert the data.

The solution

First I’ll create a database and enable CDC:

USE master
go
CREATE DATABASE TestCDC
GO
USE TestCDC
GO
EXEC sys.sp_cdc_enable_db
GO

 

Next I will create a table “Timelog”, and enable CDC on it. The Timelog table is the one we want the undo/redo capabillity on.

CREATE TABLE TimeLog (
    TimeLogId INT IDENTITY PRIMARY KEY,
    UserId INT NOT NULL,
    TimeStart DATETIME NOT NULL,
    TimeEnd DATETIME NULL,
    Description VARCHAR(200),
    Created DATETIME DEFAULT GETDATE(),
    Modified DATETIME DEFAULT GETDATE()
)
GO
EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name   = N'TimeLog',
@role_name     = NULL,
@filegroup_name = N'PRIMARY',
@supports_net_changes = 1
GO

 

The sys.sp_cdc_enable_table procedure gives the possibilities to control which filegroup the change data should be stored on, as well as a role to control access to the data. For this demo I’ll just store the data in the PRIMARY filegroup. CDC relies on SQL Server Agent to be running, so make sure it is so before running the sys.sp_cdc_enable_table procedure.

Next up is inserting some data to simulate one user performing a few correct time registrations, and another user registering some wrong data. I have put in a WAITFOR DELAY, which simulates that some time has past on since the last registration. It does not matter if it is seconds or months.

/*
Two users - userid 1 and 2
User 1 makes 3 registrations, but the third (the update) is wrong
User 2 makes 2 registrations, both are correct.
User 1 makes 1 correct registration
*/
--Correct registration
INSERT INTO TimeLog (UserId, TimeStart, Description)
VALUES (1, '2012-01-01 10:00:00', '1st work day')
GO
WAITFOR DELAY '00:00:02'
--Correct registration
INSERT INTO TimeLog (UserId, TimeStart, Description)
VALUES (1, '2012-01-02 10:00:00', '2nd work day')
GO
WAITFOR DELAY '00:00:02'
--Wrong registration/update
UPDATE TimeLog
SET
    TimeEnd = '2013-01-01 19:00:00',
    Modified = GETDATE()
WHERE
    UserId = 1
    AND TimeStart = '2012-01-01 10:00:00'
GO
WAITFOR DELAY '00:00:02'
--Correct registration
INSERT INTO TimeLog (UserId, TimeStart, Description)
VALUES (2, '2012-01-01 09:00:00', '1st work day')
GO
WAITFOR DELAY '00:00:02'
--Correct registration
INSERT INTO TimeLog (UserId, TimeStart, Description)
VALUES (2, '2012-01-02 09:00:00', '2nd work day')
GO
WAITFOR DELAY '00:00:02'
--Correct registration
INSERT INTO TimeLog (UserId, TimeStart, Description)
VALUES (1, '2012-01-03 10:00:00', '3rd work day')
GO

 

Let’s see how the data looks now:

--Notice TimeLogId = 1.
--The TimeEnd shouldn't be in 2013.
SELECT * FROM TimeLog

 

image

CDC adds a few functions that we can use to select the changes that has happened. The first we will use, is sys.fn_cdc_map_time_to_lsn, which takes a datetime as input, and returns the lsn (Transaction log sequence number) that matches the timestamp. It is also fed with a string value that tells if it should return the lsn that is

  • largest less than
  • largest less than or equal
  • smallest greater than
  • smallest greater than or equal

I’m interested in all changes that has happened on the table within the last 24 hours, so I’ll use the following lines to find those lsn’s:

DECLARE @begin_time datetime, @end_time datetime, @begin_lsn binary(10), @end_lsn binary(10);
SET @begin_time = GETDATE()-1
SET @end_time = GETDATE()
SELECT @begin_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than', @begin_time);
SELECT @end_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @end_time);

 

Now I have the lsn of the smallest lsn that has happened after the point in time GETDATE()-1 which is 24 hours ago. I also have the lsn of the last lsn that has happened before (or equal to) GETDATE(). This leads me to the next functions available. After enabling CDC on the table, a few functions and system tables was automatically added to my database:

image

 

The next function I’ll use, is the one called cdc.fn_cdc_get_all_changes_dbo_TimeLog. It’s called with the two lsn’s found above like this:

SELECT *
FROM cdc.fn_cdc_get_all_changes_dbo_TimeLog(@begin_lsn, @end_lsn, 'all update old')

 

image

I now get all the changes that has happened, and updates will be listed as a trigger solution would, by having both a delete and insert row to hold the data from before the update and after.

Now UserId = 1 realizes that invalid registrations have found its way in the table. If we look at the CDC data, then UserId = 1 created an error at the transaction with lsn = 0x0000002C0000013B0003 which happened at ‘2012-03-05 19:38:04.727’. Another function can now be used to find the netto change up until right before that time. And because we started the cdc on an empty table, the netto change correspond to how the table looked at that time. The function is similar to the cdc.fn_cdc_get_all_changes_dbo_TimeLog() we used before, now it’s just called _net_ instead of _all_. So let’s try it out:

DECLARE @begin_time datetime = GETDATE()-1
DECLARE @end_time datetime = '2012-03-05 19:38:04.727'
DECLARE @begin_lsn binary(10) = (select sys.fn_cdc_map_time_to_lsn('smallest greater than', @begin_time))
DECLARE @end_lsn binary(10) = (select sys.fn_cdc_map_time_to_lsn('largest less than', @end_time))
SELECT *
FROM cdc.fn_cdc_get_net_changes_dbo_TimeLog(@begin_lsn, @end_lsn, 'all')
WHERE UserId = 1

 

image

Now I filtered to only show changes for UserId = 1, and only up until right before the time when the wrong data was entered. This is actually the data as we would like to revert back to. Remember, that we don’t want to change anything for UserId = 2!

The way we can do this, is by using a MERGE statement, that uses the select above as source, and the TimeLog table as destination. Like this:

--Reset UserId = 1 to 2012-03-05 19:38:04.727
DECLARE @UserId INT = 1
DECLARE @begin_time datetime = GETDATE()-1
DECLARE @end_time datetime = '2012-03-05 19:38:04.727'
DECLARE @begin_lsn binary(10) = (select sys.fn_cdc_map_time_to_lsn('smallest greater than', @begin_time))
DECLARE @end_lsn binary(10) = (select sys.fn_cdc_map_time_to_lsn('largest less than', @end_time))
SET IDENTITY_INSERT TimeLog ON
MERGE INTO TimeLog tgt
USING (
    SELECT *
    FROM cdc.fn_cdc_get_net_changes_dbo_TimeLog(@begin_lsn, @end_lsn, 'all')
    WHERE UserId = @UserId
) src ON tgt.TimeLogId = src.TimeLogId
WHEN MATCHED THEN
    UPDATE SET
        tgt.TimeStart = src.TimeStart,
        tgt.TimeEnd = src.TimeEnd,
        tgt.Description = src.Description,
        tgt.Created = src.Created,
        tgt.Modified = src.Modified
WHEN NOT MATCHED BY TARGET THEN
    INSERT (TimeLogId, UserId, TimeStart, TimeEnd, Description, Created, Modified)
    VALUES (src.TimeLogId, src.UserId, src.TimeStart, src.TimeEnd, src.Description, src.Created, src.Modified)
WHEN NOT MATCHED BY SOURCE AND tgt.UserId = @UserId THEN
    DELETE;
SET IDENTITY_INSERT TimeLog OFF
GO

 

  1. The select in the USING part, is the same as the one we just saw. I’m joining the source and target on TimeLogId. If the clause matches, then we simply update all the columns of the target (TimeLog table) to the values from the source (the CDC data).
  2. The rows that exists in the source, but not the target, is being inserted (notice the IDENTITY_INSERT at the top).
  3. The rows that does not exist in the source, but is present in the target and has UserId = 1 is deleted.

This is just a basic text book MERGE statement, so no fancy thing here.

Now let’s see the data available in TimeLog:

SELECT * FROM TimeLog

 

image

All the registrations for UserId = 2 is still available, but the last two changes for UserId = 1 is gone. That was the wrong update on the “1st work day”, as well as the “3rd work day”.

If we wan’t to perform a redo (an undo of the rollback), then we simply need to execute the exact same code, but use the EndTime from right before we did the rollback. Simple, right? Smiley

The conclusion

The solution I came up with, actually solved the problem. But I haven’t done any testing using huge data volumes. And since I have absolutely no real world experience using CDC, I cannot say if this idea is truly stupid, or if it actually might work. I haven’t considered how the undo feature should be build using transactions to ensure integrity, so the solution will probably be a bit more complex than the one above.

But at least now I have an idea what CDC is and what it might be used for, and hopefully you have too Smiley

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating