Spell Check In SQL

I was tasked yesterday with writing a spell checker for an up-and-coming application. I completed the meat of the functionality today, done in sql. It's actually very acurate, pulling most words within the first 15 suggestions, but many within the first 5. It uses a table comprised of 62,070 words (a dictionary I imported from a flat file word list), which speaks to the procedure's accuracy as there are many names, cities, acronyms, etc., with the corresponding soundex values. I could have calculated the soundex values on the fly, but thought it would be more efficent having them stored as the words will not be edited. Even if they are going to be, I will require editing to be done through a stored procedure which will calculate the soundex value for the word and store it as well. The procedure will pull, sort and return over 400 records in about one second. The procedure pulls all words with a similar soundex value and with a similar structure and sorts them alphabetically in relation to the misspelled word.

The assumptions that were made in creating this function were 1) the majority of the words misspelled would be longer ones and 2) the misspellings would occur in the middle or towards the end of the word. The procedure takes the misspelled word, the precision value and has an output variable containing the suggestions (a comma delimited string) of up to 5000 characters. The precision value (a number between 1 and 3, 1 being the most precise, 3 returning the most suggestions) will effect how the procedure searches the table for the words, which in turn, affects the number of choices. For example, the soundex value of the misspelled word "dogg" is D200. So, using a precision value of 1, the procedure will search for anything with a soundex value of D20% (where percent is anything, in this case D20-D209). With a precision value of 2, it will search for words with the soundex value of D2% (D200 - D299). The higher the precision value, the more words it will return. With a precision value of 3, the procedure will return all the words that have a soundex value of D%. This means, all the words that start with the letter D. Obviously, you don't want a precision value of 4 (% or every word in the dictionary, ouch).

The soundex value is an alpha-numeric representaion of how the word sounds. The words "dawg" and "dog" have the same soundex value, D200. It's not perfect, and does nothing to sort the words. So I also searched for words that fit the structure. The procedure will look for words that match, with a precision value of 1, do% and %og. So this will return words with the same structure. Again, with a precision value of 2, it will return anything matching d%, or %g, a ton of words obviously. In most cases, one shouldn't have to use anything but the precision value of 1. The function will then sort the words. It takes the misspelled word and looks for words in the result set matching do%. It adds them to the comma delimited string, in alphabetical order, then deletes them from the result set. It then searches the remaining for d% and adds those. For a larger word, such as apple, it will search the result set for appl%, app%, ap%, then a% making the string in a specific to general order. So the finished product is a combination of soundex matches, structure matches ordered by structural similarity to the original, misspelled word.

Here is the function:


IF OBJECT_ID('spGetSuggest') IS NOT NULL
DROP PROC spGetSuggest
GO
--Jacob Steelsmith
--05/01/06
--spGetSuggest
--
--Stored procedure used to get a word
--with a similar soundex value.
--@nvchWord the misspelled word.
--@intPrecision - number representing the
----the range of the search. 1 returns the
----most precise search results (the least
----possible number of matches). 3 returns
----everything that begins with the first
----letter of the misspelled word.
--@vchReturnString - A comma delimted string
----that contains the possible values up to
----5000 characters.
CREATE PROC spGetSuggest
@nvchWord NVARCHAR(50),
@intPrecision INTEGER = 1,
@vchReturnString VARCHAR(5000) OUTPUT
AS

BEGIN TRAN GetWord

--declare vars
DECLARE @vchReturnValues VARCHAR(5000)
DECLARE @intCounter INTEGER
DECLARE @vchWordSoundex VARCHAR(5)
DECLARE @intCounter2 INTEGER
DECLARE @nvchTempWord NVARCHAR(50)

--default the counter
SET @intCounter = 0

--get the soundex of the word to compare
SET @vchWordSoundex =
(SOUNDEX(@nvchWord))

--default the return values
SET @vchReturnValues = ''

--default the soundex to 3 (we don't want
--a larger value than that. a larger value
--will pull all the words from the dictionary)
IF(@intPrecision > 3)
	SET @intPrecision = 3

--make sure the word isn't blank
IF(@nvchWord != '')
BEGIN

	--trim the end of the soundex value by the precision value
	SET @vchWordSoundex = LEFT(@vchWordSoundex, (4 - @intPrecision))

	--insert % for each number in the precision value
	--so if the soundex is T567 and the precision is 1, then the new
	--soundex value will be T56%. If the precision is 2, then the new
	--value is T5%%
	WHILE(@intCounter < @intPrecision)
	BEGIN
		SET @vchWordSoundex = @vchWordSoundex + '%'
		SET @intCounter = @intCounter + 1
	END

	--load the temp table with possible matches.
	--we're searching for, and words that match the left side of the word, minus
	--a character for each precision value, and words that match the right side
	--of the word minus a character for each precision value.
	SELECT * INTO #tmpWordTable
	FROM tblWords
	WHERE vchSoundex LIKE @vchWordSoundex
	OR nvchWord LIKE (LEFT(@nvchWord, LEN(@nvchWord) - @intPrecision) + '%')
	OR nvchWord LIKE ('%' + RIGHT(@nvchWord, LEN(@nvchWord) - @intPrecision))

	--setup the counter
	SET @intCounter2 = 1	

	--loop through the length of the word less one character
	WHILE(@intCounter2 < LEN(@nvchWord))
	BEGIN
		--get the words from the temp table
		--that are like the word minus the counter
		--with a % added to it. For example, if the
		--word is duck and the counter is at two, the
		--search value will be du%
		DECLARE getWords CURSOR FOR
		(SELECT DISTINCT nvchWord
		FROM #tmpWordTable
		WHERE nvchWord LIKE (LEFT(@nvchWord, LEN(@nvchWord) - @intCounter2) + '%'))

		--initialize the cursor
		OPEN getWords
		FETCH NEXT FROM getWords
		INTO @nvchTempWord

		--loop through the selected words
		WHILE(@@FETCH_STATUS = 0)
		BEGIN

			--append the word string with the word, then a comma
			SET @vchReturnValues = @vchReturnValues + @nvchTempWord + ','

			--delete the word from the temp table
			DELETE FROM #tmpWordTable
			WHERE nvchWord = @nvchTempWord

			--get the next word in the selection
			FETCH NEXT FROM getWords
			INTO @nvchTempWord
		END

		--deinitialize the cursor
		CLOSE getWords
		DEALLOCATE getWords

		--increment the counter
		SET @intCounter2 = @intCounter2 + 1
	END	

	--trim the extra comma
	SET @vchReturnValues = LEFT(@vchReturnValues, (LEN(@vchReturnValues) - 1))

	--set the return value
	SET @vchReturnString = @vchReturnValues

	--drop the temp table
	DROP TABLE #tmpWordTable
END

COMMIT TRAN GetWord
GO

This procedure will be used by another procedure that checks for the word, then if it is misspelled, gets the suggestions, or returns a misspelled flag if the caller simply wants to know it's misspelled.