Maintaining versions of data

  • I'm looking for different options for a new situation I have. Here is the overview:

    1) I have project-based data that gets loaded in complete sets (one project at a time).

    2) The data set for a project can get quite large so I am only interested in storing changes to the data.

    3) I need to be able to recall a full project from each revision at any time (versioning).

    This is easier to explain with simplified sample tables. Here is my first option using time-based history tables:

    First Option: Time-based versioning with history tables

    -- Time-based versioning

    CREATE TABLE [project]

    (

    [project_id] INT IDENTITY(1,1) PRIMARY KEY CLUSTERED

    ,[modify_date] DATETIME

    ,[project_name] VARCHAR(100)

    )

    CREATE TABLE [project_history]

    (

    [project_history_id] INT IDENTITY(1,1) PRIMARY KEY CLUSTERED

    ,[history_start] DATETIME

    ,[history_end] DATETIME

    ,[project_id] INT

    ,[modify_date] DATETIME

    ,[project_name] VARCHAR(100)

    )

    CREATE TABLE [task]

    (

    [task_id] INT IDENTITY(1,1) PRIMARY KEY CLUSTERED

    ,[modify_date] DATETIME

    ,[project_id] INT

    ,[task_name] VARCHAR(100)

    )

    CREATE TABLE [task_history]

    (

    [task_history_id] INT IDENTITY(1,1) PRIMARY KEY CLUSTERED

    ,[history_start] DATETIME

    ,[history_end] DATETIME

    ,[task_id] INT

    ,[modify_date] DATETIME

    ,[project_id] INT

    ,[task_name] VARCHAR(100)

    )

    The theory behind this table set is:

    1) A change to a project copies the project into the project_history table, then updates the project table.

    2) A change to a task copies the project into the project_history table, then updates the project table, then copies the task into the task_history table, then updates the task table. This task table update copies the modify_date from the project table as well to keep it in sync.

    3) Retrieval of a list of project versions is simply the list from the project_history table. Retrieving a specific version is filtered on the modify_date from the project_history table.

    Second option: Row-based versioning with primary key updates

    -- Row-based versioning

    CREATE TABLE [project]

    (

    [project_id] INT IDENTITY(1,1) PRIMARY KEY CLUSTERED

    ,[project_guid] UNIQUEIDENTIFIER -- used to find project

    ,[modify_date] DATETIME

    ,[project_name] VARCHAR(100)

    ,[version] INT

    )

    CREATE TABLE [task]

    (

    [task_id] INT IDENTITY(1,1) PRIMARY KEY CLUSTERED

    ,[modify_date] DATETIME

    ,[project_id] INT

    ,[task_name] VARCHAR(100)

    )

    The theory behind this table set is:

    1) A change to a project only creates a new row in the project table, incrementing the version number. All task records are then updated to this new primary key.

    2) A change to a task creates a creates a new row in the project table, incrementing the version number. The old task row is left alone and the new task row is inserted and linked to the new project id.

    3) Retrieval of a list of project versions is simply the list from the project table. Retrieving a specific version is based on the foreign keys in the task table.

    Personally, I prefer the first option because I don't like changing a primary key / foreign key in multiple tables. Are there more options that I should consider? Paul Neilson has a script called "AutoAudit" but I don't know how well it would apply to this type of batch insert.

    Regards,

    Michael Lato

    Regards,
    Michael Lato

  • I would go with the second option but do away with sequence numbers. Use shadow tables instead.

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply