Incremental loading of cube

  • Hi

    How to do incremental loading for following scenario rather than a full refresh of cubes:

    If FactTable has 80 million records and average increment per month is 6 million records.

    Thanks

    /**A strong positive mental attitude will create more miracles than any wonder drug**/

  • I need to process only 6 million rather than 86 million...records which is a pain in Full refresh (takes hours to complete)...

    If incremental is done, i belive it will take around 1 hour or even less.

    Anyone, please let me know how this can be done.

    Thanks for your help.

    Vik

    /**A strong positive mental attitude will create more miracles than any wonder drug**/

  • In AS 2000 (not sure about AS 2005), when you right click on a cube and select process, the next screen has three radio buttons: incremental update; refresh data; full process. If you choose the first one (incremental update) and follow the screens through, you can specify a where clause to restrict the data processed. You could specify a month date to limit the update to a specific (eg. current) month.

    Have a look at books online for more details.

    Jez

  • With that volume of records, you should be using partitioning. If you do, not only should your query reposne time be improved but it could also allow you to load specific partitions with data (e.g. today, tuesday and monday partitions, or perhaps a 'current week' type partition).

    Steve.

  • Our SSIS packages are designed in a such way that it loads data using TRUNCATE/INSERT..ie. FULL load. Therafter, what actually is happening that everytime cube is refrehed with full volume of data. All dimensions and cubes partitions are fully processed as a result of which it is taking hours. Let me know what mechanisms should be schieved to achieve incremental load of data into tables first using SSIS and after that mechanisms by which incremental load to cubes can be done.

    /**A strong positive mental attitude will create more miracles than any wonder drug**/

  • ?? Anyone...

    /**A strong positive mental attitude will create more miracles than any wonder drug**/

  • Ok, first, you have to understand that an incremental load may or may not be a potential solution. Incremental loads typicaly only work when you have very slowly changing dimensions. (Adding records to a dimension is ok.) Typically, this means that you're adding the latest period (time dimension). Hat has to happen for any attribute that is updated, is that those aggregations must be recalculated which can negate the benefits of an incremental load. Of course, I'm expecting some dissenting opinions here, but it can be a complext topic depending on your design. (Great to play and benchmark though.)

    Let's look at your overall processing time. 86 million records frankly isn't large. In my case, I process 1.4 billion records in about 2 hrs. Here's how:

    - separate (if you can) the database from the MSAS server. There's lots of reasons to run with a dedicated MSAS server (depending on usage).

    - During the processing, check how busy your MSAS server is. You need to work out strategies to maximize it's throughput, both during processing and query time.

    - I recommend partitioning your MSAS cube based on query requirements. Shoot for a maximum of about 10 to 12 million rows per partition, even less if it's practical.

    - cluster your data table based on the above partition scheme. That way, you'll get the data out of the database fasted. (Partition the table based on your incremental load strategy.)

    - don't over tax your database server. There's a law of diminishing returns. I limit the connection pool to 1 connection per CPU. The processing threads for MSAS I limit to 2 per CPU (1 per CPU in a shared environment). Play around with this, don't just use the defaults.

    This should help you speed things up a bit.

  • Dave,

    I am running MSAS on a virtual server with a single processing core. I plan to create partitions for the current month and other partitions based on year (the 2008 year partition will not have the current month in it). I will have about 40 million rows per month. I plan to incrementally update the 2008 partition with the previous month's data each month which will take less than an hour.

    You said that I should aim for about 10 million rows per partition. Currently, I would end up with about 500 million rows in a full year partition. As I have only have a single processor, will partitioning by month make any significant difference?

    I know that the SQL database engine would parallel process if it can but with only a single processor core it wont so I am limited with database processing optimsation, but I don't know whether 'parallelism' work the same way with MSAS.

    Jez

  • Jez - As Dave mentioned, you really should try for smaller partition sizes circa 20M rows or less. Have you checked any of the perf/admin guides for As2K5 or even the Project REAL documentation?

    Vik - I hate to say it but it sounds like you may want to rethink your SSIS/ETL strategy. Depending on your source data, you may be able to rework the data load to only what you need. Doing a truncate then full load sounds very much like you aren't partitioning your datamart table/s. I once worked on a project where we had a source system (internally written) that basically no timestamping of records when they'd been updated. Initially we would work with full loads but at some point, your window of opportunity (for processing) will close up. We ended up taking 'windows' records from the system based on a transactional date that was available to us, and then each month, we'd perform a full refresh.

    Steve.

  • Ok, some additional:

    Running a cube with 500 million rows on a single core (or data base for that matter) is fairly aggressive or you don't mind the wait. Running it in a virtual server, well you get what you put into it. Running all of this on the same machine (did I get that right?), ouch!

    The idea of partitioning, is that when MSAS has to do a query, if it can satisfy the query from a single partition (or just a few), then it just has to access that information. Much faster. But, if you need to access many partitions (i.e. a grand total), you'll get an initial performance hit. (Then, the results should end up in the cache.)

    As well, multiple partitions means you can process (and query) the data concurrently, taking full advantage of the hardware. But on a single CPU box, you probably don't have much headroom as it is.

    Also, allocate as much memory as you can. It's not obvious at first, but MSAS depends on the file cache to speed up I/O (as opposed to caching the data itself).

    Again, only partition on time, if you expect a typical "restricted query" will only display one month's of data or so. If trended information is important (it usually is), then partiontioning on a different dimension usually yields better results.

    Finally, parallelism typically depends on the ability of MSAS to break the query into parallel requests. It's been my experience that the best way to influence this is via the partition design.

  • Dave,

    Thanks for the info.

    This is work is part of a larger project to migrate a database from SQL 7 to (don't laugh) SQL 2K - I know it doesn't make sense but and we should be go straight to SQL2K5 but a virtual server with SQL2K is all we have to play with.

    Yes the server has to run the database as well and business will have accept the performance issuses that arrise.

    The queries that the users will hit the cube with will typically look at the last 4 months or the last 13 months. A few will look at data since Jan of the previous year. Ideally, I would use named sets to have 'last 4 months' and 'last 13 months' but the users will access the cube through Excel 2003 (bleeding edge technology again I'm afraid or not) and I cannot find a way of Excel access a named set. At present, the cube will return a full set of data for each query - not perfect but the best I can do at the moment. I had thought of using calculated members to create measures that only had the last 4 or 13 months of data which might or might not work - I haven't had the time to create these and see how they perform. Any ideas or suggestions on providing rolling time periods?

    I'm limited on the memory I can allocate as the server only has 2Gb and I have to share that with the database.

    Jez

  • Steve,

    I confess to being a bit a lazy and not wanting to create a new partition each month. I want this to be as automated as possible so if that is what I need to do then how would I do it using DTS. We are still no SQL2K.

    Jez

  • Definitely able to automate the creation of the partitions, not necessarily easy though 😉

    Prob not something that helps your cause here but am hoping that someone has mentioned to whomever is sponsoring your move to SQL2K that official support ended earlier this year, extended support runs for about another 5 yrs (see here). Maybe they should simply move to SQL2K8?

    Steve.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply