Incremental Update on Very Large Table

Question

Post reply

Incremental Update on Very Large Table

Jim C-203340

SSCommitted

Points: 1716
More actions
August 21, 2007 at 8:01 am

#197357

We have a table with over 5 million rows ( soon to be over 20 million) that store nodes and the distance between them.

Create table T1 ( colA int , colB int, colDist float )

These values are re-calculated every hour and need to be re-inserted into this table every hour. However, on any given run, only a small percentage of the rows will change. The table needs to remain available because it is being queried directly from the application.

My initial thought is to do an incremental update on the table. I had planned to bulk load the rows into a staging table

Create table ST (colA int, colB int, colDist float)

and then run three queries

1. update T1 set colDist = ST.colDist from ST join T1 on ST.colA = T1.colA and ST.colB = T1.colB and ST.colDist <> T1.colDist

2. delete from T1 where not exists ( select * from ST where colA = T1.colA and colB = T1.colB)

3. insert T1 (colA, colB, colDist) select * from ST where not exists ( select * from T1 where colA = ST.colA and colB = ST.colB)

These queries work, but are fairly slow.

The other potential solution i came up with is loading the data into a staging table and then dropping the main table and renaming the staging table. This, however, leaves the app vulnerable to query errors for that breif moment when the table does not exist.

Does anyone have any other ideas?

Regards, Jim C

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

Andras Belokosztolszki SSC-Insane Points: 21971 More actions · Answer 1

If you are on Enterprise edition, you may want to consider table partitions. SQL Server allows you then to swap in/out individual partitions, which can have their own indexes as well, so inserting these rows could be much faster.

For more info see BOL (ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/udb9/html/7d913c8d-b79b-4b1f-93b9-098dd33b07a8.htm)

Regards,

Andras

Andras Belokosztolszki, MCPD, PhD
GoldenGate Software

-=JLK=- SSCertifiable Points: 5315 More actions · Answer 2

Table partitioning is probably the way to go, but another method could be using Views and/or Stored Procedures with two different tables. Depending on Referential Integrity issues, you could truncate and then bulk load the table not in use then use ALTER statements to change which table the view/stored procedure is referencing. Any application activity should be oblivious to the Alter statement, at most it would be delayed a few seconds while the alter command completes. Of course you would want to trap for the alter command failing.

The next go around you repeat the bulk load/alter with the other table. Of course this could present "space" problems since there would effectively be two copies of the table available at some point. The good news is that Bulk Loads and Truncates don't take much (if any) log space.

James.

Andras Belokosztolszki SSC-Insane Points: 21971 More actions · Answer 3

While partitions are elegant , they do require some time too look at (I should have mentioned this). James's solution is simpler and could work too. I would however switch to bulk recovery mode. (Some bulk operations are actually logged individually in full recovery mode.)

Regards,

Andras

Andras Belokosztolszki, MCPD, PhD
GoldenGate Software

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 4

Have you tried playing with the order of the queries a bit? Looks to me that you could do a lot with putting the delete first and making sure your indexing is dead on.

something like:

1. delete from T1 where not exists ( select cola from ST where colA = T1.colA and colB = T1.colB)

2. delete from ST where exists ( select cola from ST where colA = T1.colA and colB = T1.colB and coldist=t1.coldist)

3. update T1 set colDist = ST.colDist from ST join T1 on ST.colA = T1.colA and ST.colB = T1.colB and ST.colDist <> T1.colDist

4. insert T1 (colA, colB, colDist) select colA, colB, colDist from ST where not exists ( select cola from T1 where colA = ST.colA and colB = ST.colB)

The hope is that between steps 1 & 2 you reduce the size of your tables (T1 a little, ST a lot), so that your update and insert don't take so long. Alternatively - you could rerun step 2 after step 3, so that the insert could be run without a WHERE clause.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 5

What would be more interest is to find out why you're storing this data in a table to begin with... distance calculations are not that difficult even when based on Lat/Lon.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Joe Clifford SSCrazy Eights Points: 8739 More actions · Answer 6

Joe Clifford

SSCrazy Eights

Points: 8739

August 21, 2007 at 10:05 pm

#728474

I'm gonna agree with Jeff, given two points with lat/long, calculating the distance between the two points is pretty straightforward/efficient...

Joe

Jim C-203340 SSCommitted Points: 1716 More actions · Answer 7

Hi Jeff,

Thanks for the input. The developers are generating these distances based on some clustering algorithms that they say can't be called at run-time on individual distances. My query on this imported table is esentially "return all items in the same group as my input item along with the distance each is". As far as i can see, they are right that this can't be done at run time.

Jim

Regards, Jim C

Jim C-203340 SSCommitted Points: 1716 More actions · Answer 8

Matt,

I've tried your method of doing the deletes before the insert and update. The problem with this is that the indexes are being restructured during the delete. On 5 million rows, this is taking a considerable amount of time. The delete statements actually run slightly faster without indexes - even doing a full table scan on both tables. It just seems that no matter how i approach these update statements, they will incur a large amount of overhead in re-arranging the indexes.

Jim

Regards, Jim C

Andras Belokosztolszki SSC-Insane Points: 21971 More actions · Answer 9

Andras Belokosztolszki

SSC-Insane

Points: 21971

August 22, 2007 at 10:32 am

#728637

Table partitioning would have the advantage that individual partitions will have their own indexes, so the cost of deletes will be lower. Of course this is a solution for the long term.

Regards,

Andras

Andras Belokosztolszki, MCPD, PhD
GoldenGate Software

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 10

Hi Jim,

I do the exact same type of "blot" or "grouped" distance calculations your Developers are talking about... find the geographic center of a group... find all items within a given range of given center point... find overlapping groups... etc, etc. The calculations are no biggee. But, no worries... ya gotta do what ya gotta do.

Moving on to solve your problem... it sounds to me, from your queries, like your staging table contains ALL of the data (~5 million rows) that the production table does... is that correct? If it is, don't even bother trying to do a shoe-horn merge of the data like you are... in fact, don't even try to update the production table. Just make sure that both the "staging" and "production" tables are identical. Then make a view to point to the latest version of the table. When one "offline" table get's updated, ALTER the view to point to that most recently updated table. Next update... update the table not being pointed at by the view and then ALTER the view to point to that table. It's like doing a snap rename of a synonym in Oracle. And the ALTER only takes milliseconds to change. No one should look at the tables directly... they should only look at the view which will always be pointing to the latest data.

And, even better (just realized this is a 2k5 forum), I believe that 2k5 has "Synonyms"... instead of using a view as I described, use a "Synonym" instead!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jim C-203340 SSCommitted Points: 1716 More actions · Answer 11

Jim C-203340

SSCommitted

Points: 1716

August 24, 2007 at 6:39 am

#729243

Thanks for the input Jeff. I have started working on the view based approach, but i will look into Synonyms and see how it can benefit this.

In regards to your distance calculations, is this something you do within TSQL? or are you using another language to do the work?

Jim

Regards, Jim C

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 12

Yes... 100% SQL Server.

What are your distance calculations for? Telephony, shipping, flights, or ??? And what is the source of your data... does it have only Lat/Lon or does it have V&H coordinates based on the Donald Projection Maps of certain areas of the world?

The reason for my questions is... maybe I can help...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jim C-203340 SSCommitted Points: 1716 More actions · Answer 13

The distances are actually not physical distances. The overall picture is a search system for objects that have multiple attributes. the grouping or "clustering" as we call it, groups the objects by similar attributes. The attributes are highly complex. Each object's distance to another object indicates how similar the two objects are. I don't have the exact details of this from the developers, but I will try to get more info from them.

Regards, Jim C

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 14

Great... thanks for the feedback, Jim.

Just more info on what they wanted us to do at work... they wanted a similar precalculated table of the distance from every NPA/NXX (area code and exchange) to every NPA/NXX. Once I showed them how easy it was to calculate on an as needed basis, they were pretty happy to use the calculations instead of the absolute monster table they thought they were going to need to build. Coupled with a ZipCode/NPA/NXX table, we were able to compute distances quite easily.

Not sure what all the "attributes" you have are, but sure looking forward to seeing what you and your developers are doing.

Thanks, again...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions