Fuzzy query...

  • Not entirely sure if this is possible in 2000, and I think I read somewhere you can do this in 2008; we're migrating to 2008 in 3 weeks, but I need something now (of course).

    What I am looking for is a way to capture non-conformances based on Employee Name. However, the field that the name goes into is a text field (and has to be). Issue is if I get

    Chris Smith

    C. Smith

    Smith, Chris

    Christopher Smith

    obviously my group by statement treats each one individually. I know that I can use a Case Statement, but I really hate hardcoding that sort of thing, and am too lazy to have to constantly maintain them :D.

    Is there a way (easy or not) to handle this in 2000, or do I have to wait till we get 2008 up and running?

    Any ideas would be greatly appreciated.....

  • What if name variations

    Chris Smith

    C. Smith

    Smith, Chris

    belong actually to Christine Smith, not to Christopher Smith?

    _____________
    Code for TallyGenerator

  • Not sure this will work for your case, but you can lookup SOUNDEX and see if that gets you anywhere. Documentation here: http://msdn.microsoft.com/en-us/library/aa259235(SQL.80).aspx

    Jeffrey Williams
    Problems are opportunities brilliantly disguised as insurmountable obstacles.

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Hi Sergiy - Yeah, I already thought of that; what I'm going to have to do is (I think) run a Monthly details report to go along with the counts. That way management can see that C. Smith, Chris Smith and Christopher Smith are all being lumped together. If in fact C. Smith is Christine, then I'll have to think of an alternative (as of right now, all our employees have different last names so I'm lucky, but I can't keep running on luck....).

    Jeffrey Williams, I think this is exactly what I need, and thanks for the reply - I'll do some testing today and see what I come up with.

    Thanks again

    Chris

    **Edit** - I have played around with Soundex, and it's unfortunately not as specific as I was hoping.

    select soundex('john smith')

    select soundex('j. smith')

    select soundex('smith, j')

    select soundex('johnny smith')

    gives

    J500

    J000

    S530

    J500

    I'm OK with Smith, J not being included (looking at the data, there's actually not that many lastname, firstname combinations so I can rearrange those with a case statement). Problem I'm having is, embarassingly enough, exactly what Sergiy referred to - Jane Smith is coming up as J500, which again is not a big deal (no doubles on last names in my databaes). However, Jane Small does come up with J500, and that does pose a problem.

    I've poked around and found the DBA toolkit through SQL Server Central, but the way to handle this sort of thing sounds more complicated than I can get into today. Any other ideas, or is there a way I can handle these? I've tried soundex(lastname+lastname) and a bunch of other screwy things but no such luck.

    Thanks

    Chris

  • Yeah - I was afraid of that. And, no - I don't have any other suggestions - sorry.

    Jeffrey Williams
    Problems are opportunities brilliantly disguised as insurmountable obstacles.

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Thanks anyway Jeffrey, I appreciate your time (and I did learn about Soundex, which is a cool feature....just not for this ;-))

    have a great one

    Chris

    (edited for grammar)

  • does any one have *any* ideas?? I'm starting to get huge pressure about this report from my boss...

    thanks

    chris

  • It sounds like you don't have all that much data that you're working with. Obviously a case statement explicity listing all the different variances would be out, but here are your options as I see them:

    1. Fix the Data

    2. Add another table that actually *Does* have the data normalized in it. Have an identity (or other unique field if you wish) in that table. Add that identity column to all of the other tables

    3. Create a mapping table instead of a large case statement and run through that to map variances of a name to their real names.

    4. Fix the Data

    5. Create a text parser that rearranges names to get you at least close. (IE. Match J. Doe to John Doe based on the Period, match Doe, John to John Doe based on rearranging the name when a comma is present, etc.) This is obviously only as good as your imagination for all the different variances and wouldn't be 100% accurate, but is probably one of your only options if you have a huge amount of data and you can't add fields or fix it and don't want to maintain it.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • What I am looking for is a way to capture non-conformances based on Employee Name. However, the field that the name goes into is a text field (and has to be)

    Chris, just because it has to be a text field, does not mean it has to accept erroneous or ambiguous data. The first thing I would do would be to explain to your boss why what he/she is asking for is not as easy as it would seem. Your company's data is one of it's most important assets, and, sadly, many managers have no concept of what this means. Can you put some validation code into the form where the names are entered? If not, not only are you going to be dealing with data quality issues forever (imagine when they ask you to do a match based on addresses typed in a text box), you are giving up an oportunity to have your boss be your ally when it comes to enforcing good data policy. I know you were looking for a code based solution to your problem, but I honestly believe you will be better served fixing your data problem, not hiding it.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Greg Snidow (4/2/2010)


    What I am looking for is a way to capture non-conformances based on Employee Name. However, the field that the name goes into is a text field (and has to be)

    Chris, just because it has to be a text field, does not mean it has to accept erroneous or ambiguous data. The first thing I would do would be to explain to your boss why what he/she is asking for is not as easy as it would seem. Your company's data is one of it's most important assets, and, sadly, many managers have no concept of what this means. Can you put some validation code into the form where the names are entered? If not, not only are you going to be dealing with data quality issues forever (imagine when they ask you to do a match based on addresses typed in a text box), you are giving up an oportunity to have your boss be your ally when it comes to enforcing good data policy. I know you were looking for a code based solution to your problem, but I honestly believe you will be better served fixing your data problem, not hiding it.

    I agree with this completely. This would be the best choice if you had that option... for some reason I'm assuming you don't. But if you do, do that! :hehe:

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • I'll just reply to everyone's points all at once...

    Garadin (4/2/2010)


    1. Fix the Data

    Can't - or at least it's not up to me - I'd have to go to our developers to have them do this, and that would become an ongoing maintenance issue with associated costs (even though a job could be set up etc etc)

    Garadin (4/2/2010)


    2. Add another table that actually *Does* have the data normalized in it.

    3. Create a mapping table instead of a large case statement and run through that to map variances of a name to their real names.

    Because I am not able to create tables myself, again I would have to go to the developers, and this would require my boss's approval blah blah blah.

    Garadin (4/2/2010)


    4. Fix the Data

    LOL Repetition indicates emphasis...

    Garadin (4/2/2010)


    5. Create a text parser that rearranges names to get you at least close.

    I think this may be my only option...

    Greg Snidow (4/2/2010)


    Chris, just because it has to be a text field, does not mean it has to accept erroneous or ambiguous data. The first thing I would do would be to explain to your boss why what he/she is asking for is not as easy as it would seem.

    Because of the way the our process is set up, we are unfortunately stuck with having it as a free flow text field. We have a high rate of employee turn around, and so having validation based on Employee Name would be impossible. Having an extra field (First Name / Last Name) would require development costs and a business case. So unfortunately, I am stuck with this and although my boss understands that this is not easy, his response would be "Well that's why I gave it to you - you always figure this stuff out".

    Greg Snidow (4/2/2010)


    If not, not only are you going to be dealing with data quality issues forever (imagine when they ask you to do a match based on addresses typed in a text box), you are giving up an oportunity to have your boss be your ally when it comes to enforcing good data policy. I know you were looking for a code based solution to your problem, but I honestly believe you will be better served fixing your data problem, not hiding it.

    I totally hear you - and will attempt once again to talk to my boss about a good data policy. (Asking this before searching) - Do you know of any good blogs or articles that talk about data policies, and how to put one together? I know it should be something simple like "If we do it this way, it will make this easier", but if there's information out there I'd love to get my hands on it. And thing is, my boss is pretty good at "getting" what I tell him, and so I'm pretty sure if I come to him with something he'll agree (whether or not we have the $$$ to pay for the changes, well that's another thing...).

    Thanks again for your time. It is really appreciated!

    Chris

  • This may can help you, i used ascii code unqiue value instead of soundex code, for sql 2000 need minor changes in code.

    DECLARE @position int,

    @string char(15),

    @val2 varchar(MAX)=''

    SET @position = 1

    SET @string = 'smith, j'

    WHILE @position <= DATALENGTH(@string)

    BEGIN

    SET @val2 = LTRIM(@val2) + LTRIM(( SELECT STR(ASCII(SUBSTRING(@string, @position, 1))) ))

    SET @position = @position + 1

    END

    SELECT @val2

  • As far as a good blog / post to show why fixing the data at entry is the right way to do it, I am not sure that there is one out there. Advantages of a proper solution include accurate and consistent reporting and cost savings.

    How important is this (to your manager) that the report be accurate? How long would it take you to manually scrub this data each time the report needs to be ran? How many times will he / she need to view and updated version of this report? Feel free to propose data - related solution... manual scrubbing every time, create a "cross reference" table that ties used aliases to actual employees (include time estimates to build and initially populate the solution, have business users validate the data, and time estimates over time to keep the data up to date as people add new variations... such as crhis), and then estimate how much effort it would take for your developers to limit what can be entered to an approved list (and possibly develop a front end for the end users to maintain this list). Time = Money, so communicating costs is easy. Estimate accuracy, and give examples of exceptions with each system. In my experience, management lives and breathes off of numbers. They do not need to understand the details of an implemented solution, they need to understand the costs and risks and pick one based on that. It is your job to not only understand but to explain what the options are, what the costs are, and what the risks are.

    At least, that is how I approach every new complex request.

    I was going to give just my two cents... but that looks more like three :hehe:

  • This is a real shot in the dark since you seem to not have much say in the actual data collection process, but could you record the employee's "number" instead of name in this field and report from those?

    ______________________________________________________________________

    Personal Motto: Why push the envelope when you can just open it?

    If you follow the direction given HERE[/url] you'll likely increase the number and quality of responses you get to your question.

    Jason L. Selburg
  • ...i've spoken with my boss and we're going to go with a CASE statement for now, combined with looking at the actual data. Depending on the change on the stats, we'll be putting together a business case to have the developers make the changes. Thanks ll for your time!!

    Chris

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply