A Google-like Full Text Search

Question

Post reply

A Google-like Full Text Search

& AND (this is the default inter-term operator)
| OR
~ NEAR
- NOT
" " (double quotes, to perform an explicit search, i.e. no inflectional searches)
+ (explicit search, similar to double-quotes, but for a single word)
* (wildcard - suffix only, prefix wildcards are not supported by fulltext.)

Here's an outline of what the proc does:

Accepts a raw input string (basically whatever the user types in the search box)
Escapes any "illegal" or otherwise dodgy characters
I use a table valued UDF to split the string into a table of individual "words"
Escape any parenthesis pairs. This is the 1st of the "function blocks" I mentioned above. It consists of a couple of while loops which look for parenthesis pairs, i.e. a "(" and matching ")". It keeps a record of these in a "tracking table".
The next function block replaces alternate search operators. For example, a user can use upper-case "OR" or the symbol "|" to perform an OR comparison. This block strips out all forms of operators and replaces them with unambiguous placeholders.
Tidy up by removing any invalid operators (e.g. duplicate operators, orphaned parentheses, etc).
Remove invalid combinations of operators, e.g. OR NOT.
Remove empty parentheses pairs.
The next function block is the wildcard handler. This finds all asterisks, makes sure they're at the end of search terms (i.e. suffix wildcards), and if not, it moves them to the end of the term to make a valid fulltext wildcard search.
The next function block is the quoted-phrase handler. This gets all valid pairs (i.e. opening and closing) of double-quotes, and marks the search term(s) contained therein as "explicit" search terms.
The final function block is the "NEAR" operator handler. This first finds any single (i.e. non-quoted) search terms that follow the NEAR or ~ operators. It then finds any quoted strings preceded by either operator as well.
Build the final fulltext command string from the table of individual terms and operator specifications. The default fulltext command applied to each individual search word/term is "FORMSOF(INFLECTIONAL, <search_term/>".

Here is some sample output generated by the function (I've formatted it as: [original raw search string] = [Full text command]):

[car wash] = [FORMSOF(INFLECTIONAL(car)) AND FORMSOF(INFLECTIONAL(wash))]
[car OR truck] = [FORMSOF(INFLECTIONAL(car)) OR FORMSOF(INFLECTIONAL(truck))]
[car | truck] = [FORMSOF(INFLECTIONAL(car)) OR FORMSOF(INFLECTIONAL(truck))]
["fire engine"] = ["fire engine"]
[virus -computer] = [FORMSOF(INFLECTIONAL(virus)) AND NOT FORMSOF(INFLECTIONAL(computer))]
[+swimming] = ["swimming"]
[honda NEAR car] = [honda NEAR car]
[honda ~ car] = [honda NEAR car]
[page*] = ["page*"]

That's about it. Obviously there are a few little tweaks and hacks along the way which are mostly specific to our requirements or environment. I would post the complete code, but am unsure whether I'm permitted to release it. Let me know if you have any specific questions though.

Cheers

Cheers,
Dave

Mike C SSC-Insane Points: 23224 More actions · Answer 1

bkv (12/10/2010)

Thanks, good article.

But there is one thing which is a bit confusing me.

When I am trying to do a search like 'test (' or like 'test and'

I am getting a error. But I could do a search in Google using that without any error.

So, the idea is that user has to be able to write anything he likes and search has to be done in any case.

What you could suggest?

Hi bkv,

The first thing to do is define what behavior you want to occur with a malformed search. For instance, what does 'test and' mean? Once you define that you can implement the behavior in the code. Alternatively you could just swallow the exception with try...catch, although the results might be inconsistent and somewhat unpredictable--a search would be performed, but it might not return the results the user actually wants.

Thanks

Michael

Mike C SSC-Insane Points: 23224 More actions · Answer 2

Mike C

SSC-Insane

Points: 23224

December 10, 2010 at 6:04 am

#1260994

Hi madlan,

No guarantees on the sort order of any SQL results unless you use the ORDER BY clause.

Thanks

Michael

bugmesh SSC Veteran Points: 216 More actions · Answer 3

bugmesh

SSC Veteran

Points: 216

February 16, 2011 at 11:35 am

#1287822

This is an outstanding article and the C# code is pretty darn cool! However, is there a way to do something similar in TSQL? I have been asked to produce a usp in TSQL that provides "Google-Like" Full Text search.

Any ideas or examples out there? Thank You in advance for all your support

Mike C SSC-Insane Points: 23224 More actions · Answer 4

Mike C

SSC-Insane

Points: 23224

February 16, 2011 at 12:39 pm

#1287878

Hi bugmesh

Unfortunately I don't know of a simple or efficient way to do this in pure T-SQL. Essentially you want a T-SQL based parser and lexical analyzer. That functionality doesn't really play to T-SQL's strengths... If I were going to do thay I'd probably use CLR for the parser/lexer.

bugmesh SSC Veteran Points: 216 More actions · Answer 5

bugmesh

SSC Veteran

Points: 216

February 16, 2011 at 1:09 pm

#1287898

I had a feeling that was the case. We were wanting something portable and allowing for code reuse across the environment. We were trying not to go the way of a CLR but may end up that way after all.

I appreciate your time and support

Mike C SSC-Insane Points: 23224 More actions · Answer 6

Mike C

SSC-Insane

Points: 23224

February 16, 2011 at 3:05 pm

#1287963

For portability across enviromments without using sql clr you might consider a middle-tier app, like a web service or something. If it absolutely has to be done in the database t-sql is probably not the best bet, to be honest. T-sql string manipulation and looping is not the most efficient mechanism.

DatabaseDave SSC Rookie Points: 48 More actions · Answer 7

A year or so ago (after reading this thread actually) I wrote a native T-SQL UDF to generate a fulltext search clause loosely based on the basic rules outlined on the Google cheat-sheet.

It took quite a while to write, and contains some pretty full-on string manipulation (which admitedly would be better written in .Net and implemented using the CLR). But I'm not a .Net developer so I wrote it in T-SQL anyway (and it was quite rewarding in the end). 🙂

Each "function block" within the UDF is encapsulated as far as possible, to make testing easier, and to allow me to easily move and re-order these functions within the UDF itself. The operators supported are listed below:

Mike C SSC-Insane Points: 23224 More actions · Answer 8

Mike C

SSC-Insane

Points: 23224

February 16, 2011 at 3:49 pm

#1287977

Very interesting! Just wondering, but how many lines of code did you end up writing for that? Did you use recursion at all? Did you go with L-R parsing?

Thanks

Michael

DatabaseDave SSC Rookie Points: 48 More actions · Answer 9

Its ~550 lines excluding comments.

It was originally written to simply allow for boolean operators (i.e. a lot simpler), but the requirements changed as the project progressed. So I can't say it's strictly L-R, it's more organic and a little messy TBH. I'll have a crack at refactoring it soon to see if I can trim it down a bit.

Cheers

Cheers,
Dave

Mike C SSC-Insane Points: 23224 More actions · Answer 10

Mike C

SSC-Insane

Points: 23224

February 17, 2011 at 5:17 am

#1288152

Very interesting. I hope you can share, I'd like to see how you handled operator precedence and all that.

panteluke SSC Enthusiast Points: 169 More actions · Answer 11

panteluke

SSC Enthusiast

Points: 169

March 30, 2011 at 12:51 pm

#1305627

Hi after a very long time.

This piece of code works really well.

Now, I have a question. If I want to include '(' or ')' in the input string, what changes do I have to make in the grammar?

I have experimented with many things (eg escape char ')' etc) but nothing seems to work.

Any ideas?

Mike C SSC-Insane Points: 23224 More actions · Answer 12

Good deal, glad to hear it's working well for you! I'm working on another project incorporating the awesome Irony library now, planning to demo it in about a week or so.

Parens handling is already built into the grammar, but they're strictly for grouping search terms and operators + changing the precedence of operators.

Unless you want to add them as valid search characters (you could search for the literal string "(quality)" with parens included, for instance)... that could be tricky for a couple of reasons. For one, you'll have to remove the parens handling from the code. That actually wouldn't be too difficult with a few tweaks.

The other--bigger--issue is that parens are part of the SQL iFTS grammar and are considered word separators by the word breakers for most languages. You'll have to do some serious work to override that behavior -- like writing your own custom word-breaker and stemmer. Neither task is impossible, but I don't believe the benefits would outweigh the costs in time and effort. Do you have a specialized application that requires special parens handling?

Thanks

Mike C

SQLRNNR SSC Guru Points: 281324 More actions · Answer 13

Great stuff presented in this article. I am working on a project that may benefit from this - have to see how it plays out.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

panteluke SSC Enthusiast Points: 169 More actions · Answer 14

panteluke

SSC Enthusiast

Points: 169

March 31, 2011 at 12:17 am

#1305795

Thank you Mike,

I don't have a special application, I just have complaints when they search with "for instance)" 🙂

NMTeck Grasshopper Points: 14 More actions · Answer 15

Mike, first off good article. It opened my eyes to so many other possibilities and efficient ways to use the search features in my db. I have a two part question. I'm implementing this search into an existing vb.net project. I believe that the only file I needed to copy over and edit was the SearchGrammar file. My questions are:

1. In my project, if I reference the Irony dll you included with the sample and converted SearchGrammar.cs to a vb file, would that include all I need from your project to make the changes to my current search?

2. If the above would work, would I even need to convert the cs file or would this (http://pietschsoft.com/post/2006/03/30/ASPNET-20-Use-VBNET-and-C-within-the-App_Code-folder.aspx) be a possibility for including the file directly into a vb project.

Thanks for any help you can provide.