Improving the performance of a Data Mining proc

Question

Post reply

Improving the performance of a Data Mining proc

Brendon

Ten Centuries

Points: 1252
More actions
August 15, 2002 at 7:01 pm

#105251

Hi,

This issue is not really a problem - I'm more curious to know if there is any way to improve the performance of a stored proc which effectively mines data every morning.

The situation is the daily rebuild of a reporting table, which is based from a transaction history table containing around 30,000,000 records. The sql to do this is as follows:

SELECT * INTO reporting_table

FROM transaction_history th1

WHERE NOT EXISTS

(SELECT *

FROM transaction_history th2

WHERE th1.field1 = th2.field1 AND

th1.field2 = th2.field2 AND

th1.field3 = th2.field3 AND

th2.trans_date > th1.trans_date

)

As you can see, the table is referencing itself to get the last existing record based on the required key(fields). The process takes around 1hr 15min to get approx 1.7 million records.

Is there a way to improve the performance with an index (index tuning wizard was no real help), or a change in the structure of the sql.

I've already thought of changing the process to be a cumulative effect instead (finding only those records that haven't been selected already), but I was wondering if there was anything else that could enhance what is currently.

thanks

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Andy Warren SSC Guru Points: 119747 More actions · Answer 1

What do you have indexed? Where is the clustered index? You could optimize some by creating a permanent table and just truncating it each time. Not sure how good 'not exists' is, nots are usually not great. Certainly changing to only process changed records might help. How about using a trigger to keep the reporting table current as you go? Do you really need to select * into the reporting table? How about just putting the primarykey? That would reduce disk io a lot. What kind of hardware do you have?

Andy

http://qa.sqlservercentral.com/columnists/awarren/

Andy
SQLAndy - My Blog!
Connect with me on LinkedIn
Follow me on Twitter

Brendon Ten Centuries Points: 1252 More actions · Answer 2

I've tried different indexes with little result. A cluster index isn't really going to help, considering that the history table is updated daily with 5000+ records.

I don't particularly like "Not Exists", yet the only other way of trying to achieve the same ends is by using a cursor (yuck!).

Unfortunately the table only has 7 columns and the primary key is a compound of 5 of them. Apart from that, I also need all fields for other purposes.

Platform is NT 4.0, running on SQL 7 on twin PIII 550 with 780Mb RAM, SCSI equipped.

I'm experimenting with a cumulative option, and will see how it goes....

Andy Warren SSC Guru Points: 119747 More actions · Answer 3

If you put just the primary key into your table you'd just have to join back to get the other details. Would be interesting to see if it made much difference, you could just add an identity col to your table and index unique no nulls, treat it like a pkey.

Suggest post the query asking for alternative solutions, pretty sure can be done without not exists without a cursor.

Not sure I agree about clustered - then again, I havent tested your scenario!

780m is not a lot. Hows the server holding up during the query? Could you post the query plan?

Andy

http://qa.sqlservercentral.com/columnists/awarren/

Andy
SQLAndy - My Blog!
Connect with me on LinkedIn
Follow me on Twitter

NPeeters SSChampion Points: 12358 More actions · Answer 4

This might be a method to get rid of the NOT EXISTS ... (I think). Don't know about performance though...

SELECT * INTO reporting_table

FROM transaction_history th1

WHERE th1.trans_date =

(SELECT MAX(th2.trans_date) FROM

transaction_history th2

WHERE th1.field1 = th2.field1 AND

th1.field2 = th2.field2 AND

th1.field3 = th2.field3)

Edited by - NPeeters on 08/16/2002 05:03:13 AM

Brendon Ten Centuries Points: 1252 More actions · Answer 5

cheers,

gave the max date a whirl, but it was still going after 2 hours, so not much benefit there.

I tried a cumulative option, but it's harder than I first thought, since every record added may be replacing one already there - ie. it's the last record occurence for that group, so adding records isn't the go.

I'll post the question on T-sql forum and see what people suggest.

thanks

Prakash Heda SSCertifiable Points: 5539 More actions · Answer 6

A different thought.....

Apprently there is not much can be done but if you follow this method performance will certainly improve......

SP_PH_WHO2

SELECT

FIELD1 + FIELD3 + FIELD3 AS COMB_FIELDS,

FIELD1,

FIELD3,

TRANS_DATE

INTO ##TMP_TRANSACTION_HISTORY

FROM TRANSACTION_HISTORY

CREATE CLUSTERED INDEX P_##TMP_TRANSACTION_HISTORY ON ##TMP_TRANSACTION_HISTORY(COMB_FIELDS,TRANS_DATE) WITH PAD_INDEX,FILLFACTOR = 100,STATISTICS_NORECOMPUTE

SELECT * INTO REPORTING_TABLE

FROM ##TMP_TRANSACTION_HISTORY TH1

WHERE NOT EXISTS

(SELECT *

FROM ##TMP_TRANSACTION_HISTORY TH2

WHERE TH1.COMB_FIELDS = TH2.COMB_FIELDS AND

TH2.TRANS_DATE > TH1.TRANS_DATE

)

DROP TABLE ##TMP_TRANSACTION_HISTORY

This method will not take more than 30 minutes on 3 million records.

Three things are here

1. Adding theses columns will not create any problem as you want exactly theses three columns to be matched

2. Move processing into tempdb as that is usually 0+1 drive and logging is much less there.

3. Clustered index will make it very fast for comparing

In REPORTING_TABLE table if you need some extra columns to be added add it in the first step while inserting data into ##tmp table.

Cheers,

Prakash

Prakash Heda
Lead DBA Team - www.sqlfeatures.com
Video sessions on Performance Tuning and SQL 2012 HA

Antares686 SSC Guru Points: 125444 More actions · Answer 7

Note really sure you need the subquery as you have it.

SELECT * INTO reporting_table

FROM transaction_history th1

WHERE th1.trans_date =

(SELECT MAX(th2.trans_date) FROM

transaction_history th2

WHERE th1.field1 = th2.field1 AND

th1.field2 = th2.field2 AND

th1.field3 = th2.field3)

or

SELECT * INTO reporting_table

FROM transaction_history th1

WHERE NOT EXISTS

(SELECT *

FROM transaction_history th2

WHERE th1.field1 = th2.field1 AND

th1.field2 = th2.field2 AND

th1.field3 = th2.field3 AND

th2.trans_date > th1.trans_date

)

Looks to me like you want to get the MAX value of th1.trans_date but you are adding overhead by having to match values first.

Try

WHERE th1.trans_date = (SELECT MAX(th2.trans_date) FROM transaction_history th2)

Now the other thing is SELECT INTO is notorious for locking on tables and can lead to other issues.

Instead it is considered to do INSERT (INTO) instead. Now I assume the reporting_table is already built so you should be able to do

INSERT reporting_table (collist)

SELECT collist FROM transaction_history th1

WHERE th1.trans_date = (SELECT MAX(th2.trans_date) FROM transaction_history th2)

If this is slow then try do

SET SHOWPLAN_TEXT ON

GO

INSERT reporting_table (collist)

SELECT collist FROM transaction_history th1

WHERE th1.trans_date = (SELECT MAX(th2.trans_date) FROM transaction_history th2)

GO

SET SHOWPLAN_TEXT OFF

GO

And post the execution plan here so we can look at what is happening and offer suggestions.

"Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

Prakash Heda SSCertifiable Points: 5539 More actions · Answer 8

Prakash Heda

SSCertifiable

Points: 5539

August 22, 2002 at 12:23 am

#435000

Confused, can you explain what exactly are you trying?

Finding newly added records or for each combination of (fld1 + fld2 + fld3) latest records........

Cheers,

Prakash

Prakash Heda
Lead DBA Team - www.sqlfeatures.com
Video sessions on Performance Tuning and SQL 2012 HA