Data Cleansing

  • Garbage in, garbage out. No truer words spoken by the DBA. I love seeing articles like this, no, not those picking on the various credit database companies. Articles that look at data quality and how technology isn't at fault in many cases.

    As a DBA, I've been trusted with guarding the data for many companies. I've been in charge of making sure it's available, protected, and intact. That what data is in the database doesn't get corrupt or inappropriately changed.

    However, I can only work with what I'm given. If data entry people put in bad data, what can we do? If lists we load are bad, is it our fault? Can't the ETL processes weed this out?

    The answer is somewhat. Those of us that are good have learned to protect ourselves without backout plans to remove inappropriate data, with routines that clean and transform data into standard formats and even values in cases, and more. However we are still often at the mercy of the providers of data, the editors.

    I can prevent someone from deleting all rows in a table (except for myself, of course), but I can't always stop Joe in accounting from entering $350,000 instead of $3500. Or vice versa. At some point I have to trust and allow that users will work with the data and could make mistakes. The DBA, the most visible guy that's working with the data, isn't always at fault.

    Maybe I should keep this URL in the DR handbook to show the CIO the next time I have a "Whoops"?

    Steve Jones

  • You can lead a user to data, but you cannot make them think.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • Ed I love your quote and remember its not possible to under estimate the the IQ of a typical user

    Mike 

  • Hello Steve,

    Being a D.B.A. in regards to the database is like being the father of a 13 year old girl. Your job is to protect her and keep her pure. Anything less is cause for recrimination. I had a friend who became a grandfather when his daughter was 13. So while he could blame the boy all day long but he allowed it to happen. Our role in protecting the data is somewhat the same. I have at times explained this to developers and it is easier to deal with them once they understand how I feel about protecting the data.

    Best wishes,

    Barry O'Connell

  • Some of the things we do in our shop:

    Invoice quantities have thresholds by product to throw an "excess qty" warning.

    Heavy data entry gets a second person doing validation. Double data entry is another possibility.

    You can't catch everything, and we're set up to easily and accountably make corrections for the ones we miss, but there are definitely ways to minimize the problem on the front end.

  • Expanding on Dennis's comments a little:

    <from the editorial> "...but I can't always stop Joe in accounting from entering $350,000 instead of $3500..."

    Presumably, there are some business rules that dictate reasonable values for these data.

    For instance, "house sale price must be reviewed if sale price is below $70,000 would prevent the 3,500 going into a house sale price column (at least without review).

    Likewise, "biweekly salary is less than $15,000" would prevent the $350,000 amount going into a "salary paid" column.

    Both of these are business rule constraints that can, and arguably should be declared to the database.

    Another common example I see: if you design a table to hold US Addresses, and have a column called "AddressLine1" with the obvious semantics, don't make it an nvarchar(255), or you WILL get a 255-character long string of mandarin/kanji/latin characters in there... someday. Why not, instead, use the US Postal code data guidelines and make it a varchar(35)? You're much less likely to have data problems that way.

     

    In short, think about your domains (datatypes) and business rules, and tell the DBMS about them.

     

    TroyK

  • I believe his point was more along the lines of even if the user enters data that fits within all business rules (and db constraints) there is NO gaurantee that the user is putting in the correct data - typos happen!

    Darrell

  • I guess the point that I'm trying to make is that as db practicioners, we need to be sure our own house is clean before we go pointing fingers at the users... and in my experience, there's a lot of housecleaning to be done.

     

    TroyK

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply