Data Warnings / Ignore Checked Results

Question

Post reply

Data Warnings / Ignore Checked Results

Phil Parkin

SSC Guru

Points: 245813
More actions
March 8, 2021 at 1:37 pm

#3854933
Not looking for any code, just ideas.

I have a series of queries which run regularly, which return data rows which we consider to be potentially incorrect or worthy of being checked. These queries run against any database/table deemed particularly interesting/important.

Imagine a scenario where 75 rows of data are identified, across three different queries. 60 of these rows have been checked and found to be OK – not an issue. I'd like to exclude these rows from the results the next time the query runs, so that focus can be given to those rows which have not been assessed as OK.

I can think of no flexible, straightforward and easy-to-maintain way of excluding these results. Some ideas I've had:
1. Build an exclusion table for every query and populate it with the PKs of the rows to be excluded. Modify the original query and include AND PK NOT IN (SELECT PK FROM EXCLUSION TABLE)
2. Modify the query code to do the same as above
3. Same as the first option, but give every query a unique number and build a single exclusion table, then use AND PK NOT IN (SELECT PK FROM EXCLUSION TABLE WHERE QUERYNUM = n).
If I had just one or two of these queries and they were fairly static, either of options one and two would work OK, with the first being easier to maintain. But it's not the case: queries can be added or removed fairly regularly (though the numbers are manageably low – once or twice a month) – option three would handle this.

I can't be the first person thinking about implementing this. Any ideas or suggestions, anyone?

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

ScottPletcher SSC Guru Points: 100262 More actions · Answer 1

You could add a tinyint (or bit) indicator to the table row itself indicating whether that row has been reviewed already. (Maybe tinyint so that there are bits for other statuses, if you decide you need them.) Of course that requires modifying the tables themselves, but it would be more efficient.

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

as1981 SSCrazy Points: 2806 More actions · Answer 2

Taking a significantly different angle on this and apologises if it's too far outside of the scope of your question but could you use Change Data Capture to identify changed rows? Then you could do something like this:

One time process per table:

Step 1: Select changes from CDC stored procedure

Step 2: Evaluate rows

Step 3: Store the highest lastlsn of returned rows

Repeated process per table:

Step 1: Select changes from CDC stored procedure since the lastlsn

Step 2: Evalulate rows

Step 3: Store the highest lastlsn of returned rows

Please note: I have only implemented this on a small test system. I have no idea how well it would scale in production.

Phil Parkin SSC Guru Points: 245813 More actions · Answer 3

Scott and as1981, thank you for taking the time to respond.

These responses at least suggest that I did not miss anything obvious.

Unless I have misunderstood, CDC does not achieve what I wish to achieve. If there is a row which has not been marked as OK, that row should continue being returned by the SELECT query. I cannot see how the proposed solution would accommodate this.

Scott's approach works well enough, though I do not wish to universally add an IsValid flag to our tables, nor do I want to have to make a DDL change to add the new column every time I wish to validate data on a table which has not previously been validated.

I think I will proceed with my suggested option 3.

This reply was modified 3 years, 4 months ago by Phil Parkin.

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

as1981 SSCrazy Points: 2806 More actions · Answer 4

as1981

SSCrazy

Points: 2806

March 9, 2021 at 12:27 pm

#3855450

Apologises I was thinking that the rows you need to check might all be new or have been updated. As you said it won't pick up existing, unchanged, rows.

Phil Parkin SSC Guru Points: 245813 More actions · Answer 5

as1981 wrote:

Apologises I was thinking that the rows you need to check might all be new or have been updated. As you said it won't pick up existing, unchanged, rows.

I thought so. No need to apologise. I thank you once again for taking the time to respond.

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

Grant Fritchey SSC Guru Points: 397721 More actions · Answer 6

Also test whether doing an outer join to the approved table and looking for nulls might be faster than the NOT IN.

----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software

Phil Parkin SSC Guru Points: 245813 More actions · Answer 7

Grant Fritchey wrote:

Also test whether doing an outer join to the approved table and looking for nulls might be faster than the NOT IN.

Thanks, Grant. To be honest, I'd probably have implemented it this way anyway.

My original NOT IN text perhaps made the question easier to understand than a LEFT JOIN ... IS NULL.

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

frederico_fonseca SSCoach Points: 15304 More actions · Answer 8

option 3 - but with addition of fields

time created -- used to determine how long it has been identified as an issue
validated by
validated date
validation status (ok/not ok/don't care/don't like the boss)

and I would make it so that the PK can be removed and re-added - what if data was fixed a year ago and has gone bad again? fact it was reported and validated a year ago would cause it to be ignored now.

ScottPletcher SSC Guru Points: 100262 More actions · Answer 9

Phil Parkin wrote:

Scott and as1981, thank you for taking the time to respond.

Scott's approach works well enough, though I do not wish to universally add an IsValid flag to our tables, nor do I want to have to make a DDL change to add the new column every time I wish to validate data on a table which has not previously been validated.

I think I will proceed with my suggested option 3.

Not objections to a combined validation table. I would point out, though, that if you were to go the is_valid flag route, I'd just add it to all tables immediately, so there would be no future maintenance. This is much easier to do if all tables are only accessed via a view, so that the underlying physical table can be very easily changed with no effect on existing table uses.

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 10

Just a thought, and probably not a very good one. But in lieu of status flags validating rows, or exclusion tables. Why not remove suspect/invalid rows to other tables with similar layouts? Perhaps with a few other columns such as date removed, a code for the reason it was considered suspect, and the number for the query, as you described in your initial post. If and when rows are corrected, they can be restored to the "live" table.

Although there is the workload of copying them back and forth, there is no need for code to exclude . Your "normal" code would run faster for not having to test against an exclusion table every time. I understand that this approach isn't going to be usable if rows are only considered suspect for some queries, but not others. But I've always been of the opinion that garbage data is garbage data.

Good luck, whichever way you go.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills