A Google-like Full Text Search

  • bkv (12/10/2010)


    Thanks, good article.

    But there is one thing which is a bit confusing me.

    When I am trying to do a search like 'test (' or like 'test and'

    I am getting a error. But I could do a search in Google using that without any error.

    So, the idea is that user has to be able to write anything he likes and search has to be done in any case.

    What you could suggest?

    Hi bkv,

    The first thing to do is define what behavior you want to occur with a malformed search. For instance, what does 'test and' mean? Once you define that you can implement the behavior in the code. Alternatively you could just swallow the exception with try...catch, although the results might be inconsistent and somewhat unpredictable--a search would be performed, but it might not return the results the user actually wants.

    Thanks

    Michael

  • Hi madlan,

    No guarantees on the sort order of any SQL results unless you use the ORDER BY clause.

    Thanks

    Michael

  • This is an outstanding article and the C# code is pretty darn cool! However, is there a way to do something similar in TSQL? I have been asked to produce a usp in TSQL that provides "Google-Like" Full Text search.

    Any ideas or examples out there? Thank You in advance for all your support

  • Hi bugmesh

    Unfortunately I don't know of a simple or efficient way to do this in pure T-SQL. Essentially you want a T-SQL based parser and lexical analyzer. That functionality doesn't really play to T-SQL's strengths... If I were going to do thay I'd probably use CLR for the parser/lexer.

  • I had a feeling that was the case. We were wanting something portable and allowing for code reuse across the environment. We were trying not to go the way of a CLR but may end up that way after all.

    I appreciate your time and support

  • For portability across enviromments without using sql clr you might consider a middle-tier app, like a web service or something. If it absolutely has to be done in the database t-sql is probably not the best bet, to be honest. T-sql string manipulation and looping is not the most efficient mechanism.

  • A year or so ago (after reading this thread actually) I wrote a native T-SQL UDF to generate a fulltext search clause loosely based on the basic rules outlined on the Google cheat-sheet.

    It took quite a while to write, and contains some pretty full-on string manipulation (which admitedly would be better written in .Net and implemented using the CLR). But I'm not a .Net developer so I wrote it in T-SQL anyway (and it was quite rewarding in the end). 🙂

    Each "function block" within the UDF is encapsulated as far as possible, to make testing easier, and to allow me to easily move and re-order these functions within the UDF itself. The operators supported are listed below:

  • & AND (this is the default inter-term operator)
  • | OR
  • ~ NEAR
  • - NOT
  • " " (double quotes, to perform an explicit search, i.e. no inflectional searches)
  • + (explicit search, similar to double-quotes, but for a single word)
  • * (wildcard - suffix only, prefix wildcards are not supported by fulltext.)
  • Here's an outline of what the proc does:

  • Accepts a raw input string (basically whatever the user types in the search box)
  • Escapes any "illegal" or otherwise dodgy characters
  • I use a table valued UDF to split the string into a table of individual "words"
  • Escape any parenthesis pairs. This is the 1st of the "function blocks" I mentioned above. It consists of a couple of while loops which look for parenthesis pairs, i.e. a "(" and matching ")". It keeps a record of these in a "tracking table".
  • The next function block replaces alternate search operators. For example, a user can use upper-case "OR" or the symbol "|" to perform an OR comparison. This block strips out all forms of operators and replaces them with unambiguous placeholders.
  • Tidy up by removing any invalid operators (e.g. duplicate operators, orphaned parentheses, etc).
  • Remove invalid combinations of operators, e.g. OR NOT.
  • Remove empty parentheses pairs.
  • The next function block is the wildcard handler. This finds all asterisks, makes sure they're at the end of search terms (i.e. suffix wildcards), and if not, it moves them to the end of the term to make a valid fulltext wildcard search.
  • The next function block is the quoted-phrase handler. This gets all valid pairs (i.e. opening and closing) of double-quotes, and marks the search term(s) contained therein as "explicit" search terms.
  • The final function block is the "NEAR" operator handler. This first finds any single (i.e. non-quoted) search terms that follow the NEAR or ~ operators. It then finds any quoted strings preceded by either operator as well.
  • Build the final fulltext command string from the table of individual terms and operator specifications. The default fulltext command applied to each individual search word/term is "FORMSOF(INFLECTIONAL, <search_term/>".
  • Here is some sample output generated by the function (I've formatted it as: [original raw search string] = [Full text command]):

  • [car wash] = [FORMSOF(INFLECTIONAL(car)) AND FORMSOF(INFLECTIONAL(wash))]
  • [car OR truck] = [FORMSOF(INFLECTIONAL(car)) OR FORMSOF(INFLECTIONAL(truck))]
  • [car | truck] = [FORMSOF(INFLECTIONAL(car)) OR FORMSOF(INFLECTIONAL(truck))]
  • ["fire engine"] = ["fire engine"]
  • [virus -computer] = [FORMSOF(INFLECTIONAL(virus)) AND NOT FORMSOF(INFLECTIONAL(computer))]
  • [+swimming] = ["swimming"]
  • [honda NEAR car] = [honda NEAR car]
  • [honda ~ car] = [honda NEAR car]
  • [page*] = ["page*"]
  • That's about it. Obviously there are a few little tweaks and hacks along the way which are mostly specific to our requirements or environment. I would post the complete code, but am unsure whether I'm permitted to release it. Let me know if you have any specific questions though.

    Cheers

    Cheers,
    Dave

  • Very interesting! Just wondering, but how many lines of code did you end up writing for that? Did you use recursion at all? Did you go with L-R parsing?

    Thanks

    Michael

  • Its ~550 lines excluding comments.

    It was originally written to simply allow for boolean operators (i.e. a lot simpler), but the requirements changed as the project progressed. So I can't say it's strictly L-R, it's more organic and a little messy TBH. I'll have a crack at refactoring it soon to see if I can trim it down a bit.

    Cheers

    Cheers,
    Dave

  • Very interesting. I hope you can share, I'd like to see how you handled operator precedence and all that.

  • Hi after a very long time.

    This piece of code works really well.

    Now, I have a question. If I want to include '(' or ')' in the input string, what changes do I have to make in the grammar?

    I have experimented with many things (eg escape char ')' etc) but nothing seems to work.

    Any ideas?

  • Good deal, glad to hear it's working well for you! I'm working on another project incorporating the awesome Irony library now, planning to demo it in about a week or so.

    Parens handling is already built into the grammar, but they're strictly for grouping search terms and operators + changing the precedence of operators.

    Unless you want to add them as valid search characters (you could search for the literal string "(quality)" with parens included, for instance)... that could be tricky for a couple of reasons. For one, you'll have to remove the parens handling from the code. That actually wouldn't be too difficult with a few tweaks.

    The other--bigger--issue is that parens are part of the SQL iFTS grammar and are considered word separators by the word breakers for most languages. You'll have to do some serious work to override that behavior -- like writing your own custom word-breaker and stemmer. Neither task is impossible, but I don't believe the benefits would outweigh the costs in time and effort. Do you have a specialized application that requires special parens handling?

    Thanks

    Mike C

  • Great stuff presented in this article. I am working on a project that may benefit from this - have to see how it plays out.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Thank you Mike,

    I don't have a special application, I just have complaints when they search with "for instance)" 🙂

  • Mike, first off good article. It opened my eyes to so many other possibilities and efficient ways to use the search features in my db. I have a two part question. I'm implementing this search into an existing vb.net project. I believe that the only file I needed to copy over and edit was the SearchGrammar file. My questions are:

    1. In my project, if I reference the Irony dll you included with the sample and converted SearchGrammar.cs to a vb file, would that include all I need from your project to make the changes to my current search?

    2. If the above would work, would I even need to convert the cs file or would this (http://pietschsoft.com/post/2006/03/30/ASPNET-20-Use-VBNET-and-C-within-the-App_Code-folder.aspx) be a possibility for including the file directly into a vb project.

    Thanks for any help you can provide.

  • Viewing 15 posts - 121 through 135 (of 166 total)

    You must be logged in to reply to this topic. Login to reply