Offloading Historical Data Periodically

  • I've read a couple of approaches for similar requirements, but thought it would be best to ask for some feedback before I embark on this one...

    We amass a several million rows of audit data each year- basically each row in an audit table is a version of the original table's row before it gets updated. This is of course leading to space utilization concerns and some potential performance issues. Since we really only need a year's worth of historical data available to customers at any given time, I'd like to offload older data at regular intervals - let's say monthly.

    The offloaded data will be kept, of course, and we may need to run reports against it on rare occasions - and probably include the "current" (last year's worth) of data in these reports as well.

    We are open to any and all suggestions, inside or outside the box. 🙂

  • Well, one of the keyiissues is "and we may need to run reports against it on rare occassions". In my situation that means the data all has to remain on-line in the DB for the full 6 year Record Retention Requriement:(. You may be more fortunate.

    One technique I saw was to select out the data in an XML format. For each exported row, the DBA added a few audit trail and house keeping columns like [ExportDateTime], [ExportOperator], [SourceTable], etc., followed by each column name, type, and value.

    So basically you end up with a flat text file that can be archived to a permanent media and could be used to reconstruct the original rows if required.

    Since you are currently archiving it on-line I assume you have some sort of row-versioning already built in to each row. That meta data would help you determine which XML row(s) you need to restore.

  • Ah, I think I should have clarified... Customers need to be able to report on the last year's worth of data, but on rare occasions we may run a special-request report for a customer... the request would come through customer service and we'd have a day or two to pull it together. So, fortunately, we don't have to keep ALL the audit data in the production database. Whew.

    We do have several audit meta-data columns in the audit tables, such as date of the audit, bitmasks indicating which data changes, who made the change, etc. This data is also included on the customer-run reports.

    The XML approach is interesting, but I wouldn't mind moving the old data to another DB on another server, for instance.

    We are most concerned, at this point, in the offloading system itself; that is, how to best get the data out of the production database and over to an aux database (or XML flat file, or whatever), and then clean up the prod db... For example, are we better off just performing an INSERT INTO...SELECT and then a DELETE? Perhaps there's some BULK command we should consider?

  • In my environment I can not just export rows. For each "Item" exported, I would have to build an extended set of rows from a number of tables/databases. I am sure it will involve more than 2 dozen tables from 6 or 7 DB's with as many as 10 or 12 rows from some of the tables. This entire data set would have to be exported and then deleted basically as a Transaction. I am not looking forward to working that process out 🙂

    My initial plan was to build a mini-dataset to capture the required data and store that data to a new staging database. Then kick off a distributed transaction? to remove the data from the requisite rows in the production databases/tables. This will have to be a row by ow delete due to the nature of the data.

    We have also discussed having the actual archiving built into the Application Middle Tier and be transparent to the Database server.

  • Looks like your scenario is a good deal more complex than mine, Ray. Thanks for putting things in perspective!

    I am still thinking that some sort of batch is the way to go here... If I were to delete each of my audits row by row my trans log would fill up pretty quick with several hundred thousand deletes being issued. Maybe temporarily switching the logging method/recovery model is the answer...

  • I would get the required data that needs to be kept into a new table and rename that with the original one.Once i make sure verything looks good would drop my old renamed table.

  • The T-Log is not that much of an issue as long as you remember a couple of key points.

    The process you are implementing does not have to finish quickly in OLTP terms. I have a batch job that (can you believe it) has a Cursor and a "Wait For Delay". (oh the shame:))

    The jog runs frequently. When it runs it Loops through to delete a moderate number of rows and then wait for 5 seconds. I put in some logic to do issue a Checkpoint after a larger number of rows. The way the job is timed it runs just after a periodic T-Log backup. So the log never gets to far out of kilter.

    Once the job has "caught up" with the housekeeping it usually only finds a few hundred items to delete for each run so the log becomes even less of an issue.

    You could do the same thing with your archive. Run say once an hour, generate a list of rows to archive, then archive one, delete it, delay, loop. In my case I let the job run for 15 min or until there is no more to delete. Once you catch up the archive (which may take a few days:)) then the job would probably never find more than a few dozen items to archive.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply