Likesearches in large SQL tables

I’ve had several clients now that have requested a stored procedure that will return records based on key words entered by a user. There can be one or many key words and all have to match in order to match the item or product. I used to do this using a LIKE ‘%keyword1%’ AND LIKE ‘%keyword2%’ type of query using Dynamic SQL.

Using the LIKE operator like this on a large table can make for a long running query since no indexes can be used. A search query on an auto parts table with 1.2 million items took over 30 seconds to run for one of my clients. I developed a technique that would index each word in the item description in another table and use that for a search procedure. The average run time on that procedure is now 1.5 seconds.

In this article I’m going to use the AdventureWorks database to build such a table and show how to use it. Keep in mind that there isn’t much performance gain in this example since AdventureWorks had just over 500 products in it. ( I would never bother with this technique on a table that small.)

There are 2 tables that you have to create: KeyWordItems and KeyWordsToExclude. The first table has the primary key of the product or item table – in this case ProductID from Production.Product. The second table contains words that you want to ignore. This could be ‘a’, ‘an’, ‘the’ and so forth. Another type of word to ignore in most cases would be any single character word, which would exclude such things as ‘&’ or “#” if there are spaces on either side.

Here is the code to create and populate the KeyWordsToExclude table:

CREATE TABLE KeyWordsToExclude
( KeyWord VARCHAR(50) NOT NULL )
GO
ALTER TABLE KeyWordsToExclude
ADD CONSTRAINT PK_KeyWordsToExclude
PRIMARY KEY CLUSTERED (KeyWord)
GO

— Populate the KeyWordsToExclude table with whatever you don’t want to
— use.
INSERT INTO KeyWordsToExclude
( KeyWord )
SELECT ‘A’ UNION ALL
SELECT ‘AN’ UNION ALL
SELECT ‘THE’
GO
Here is the KeyWordItems table:

CREATE TABLE KeyWordItems
( KeyWord VARCHAR(50) NOT NULL
, ProductID INT NOT NULL )
GO

ALTER TABLE KeyWordItems
ADD CONSTRAINT PK_KeyWordItems
PRIMARY KEY CLUSTERED (KeyWord, ProductID)
GO
In order to populate the KeyWordItems table, you should have Jeff Moden’s table valued function – DelimitedSplit8KNEW. It is lightning fast and I’m including it here for simplicity (I hope Jeff doesn’t mind).

CREATE FUNCTION dbo.DelimitedSplit8KNEW
–===== Created by Jeff Moden (Prototype: Testing Still in Progress)
–===== Define I/O parameters
(
@pString VARCHAR(8000),
@pDelimiter CHAR(1)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH
E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), –10
E2(N) AS (SELECT 1 FROM E1 a, E1 b), –100
E4(N) AS (SELECT 1 FROM E2 a, E2 b), –10,000
cteTally(N) AS (
SELECT 0 UNION ALL
SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E4
)
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY t.N),
ItemValue = SUBSTRING(@pString,t.N+1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,t.N+1),0),DATALENGTH(@pString)+1)-t.N-1)
FROM cteTally t
WHERE t.N BETWEEN 0 AND DATALENGTH(@pString)
AND (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)
;
GO
Now we come to actually populating the KeyWordItems table. There are a few quirks in the AdventureWorks product names. Slashes are used to separate some words – e.g. ‘LL Mountain Seat/Saddle’. Also some product names have commas after a word – e.g. ‘Mountain Bike Socks, M’. This type of punctuation is not likely to be entered by a user searching by product name and so needs to be ignored. In the code below there are 2 REPLACE functions that will replace a ‘/’ with a space (which makes it a separate word) and strip out commas.

SELECT
RecID = IDENTITY (INT, 1, 1)
, CAST(P.ProductID AS INT) AS ProductID, Split.ItemValue AS KeyWord
INTO #PW — ProductWords
FROM Production.Product P
CROSS APPLY (SELECT ItemValue FROM dbo.DelimitedSplit8KNEW(REPLACE(REPLACE(P.Name, ‘/’, ‘ ‘), ‘,’, ”), ‘ ‘)) AS Split
WHERE LEN(ItemValue) > 1

— Delete the words that are in KeyWordsToExclude (This could have been
— included
— above but done as a separate step to keep the SQL simple. In
— AdventureWorks there aren’t any product names that will be excluded.
DELETE PW
FROM #PW PW
INNER JOIN KeyWordsToExclude KWE ON
PW.KeyWord = KWE.KeyWord

— Delete any product key words with the same word twice. Again
— AdventureWorks product names don’t have the same word twice in them.
— Your data may differ.
DELETE PW
FROM #PW PW
INNER JOIN
(SELECT MIN(RecID) AS MinRecID, ProductID, KeyWord
FROM #PW
GROUP BY ProductID, KeyWord
HAVING COUNT(*) > 1
) AS X ON
PW.ProductID = X.ProductID
AND PW.KeyWord = X.KeyWord
AND PW.RecID X.MinRecID

— Populate the KeyWordItems table.
INSERT INTO KeyWordItems
( ProductID, KeyWord )
SELECT
ProductID, KeyWord
FROM #PW
It is VERY important to note that the SAME logic that is used to populate the KeyWordItems table must be used when handling the user input – i.e. the same punctuation handling, the same ignored words, the same restriction on no single letter words, etc. Otherwise you may end up where something that should match that won’t match.

Once the key words table has been created it will have to be maintined either in a procedure or trigger when an update occurs to the column you are using for the words (item description, product name, etc.). If any change is made to that column the procedure or trigger should delete and re-add the key words using the same logic that was used to initially populate it.

Now for some actual queries.

Here is a typical query for 2 key words – usually created in Dynamic SQL – that would search the Products table.

SELECT
ProductID, Name, Color, Size, ListPrice
FROM Production.Product
WHERE
Name LIKE ‘%nut%’
AND Name LIKE ‘%hex%’This would return something like:

Here are the actual stats from SET STATISTICS TIME ON and SET STATISTICS TIME ON which was set before running the query above:

Table ‘Product’. Scan count 1, logical reads 23, physical reads 3
, read-ahead reads 0, lob logical reads 0, lob physical reads 0
, lob read-ahead reads 0.

SQL Server parse and compile time:
CPU time = 15 ms, elapsed time = 270 ms.

The important point above is the table scan on the Product table. This is what the query plan for the above query looks like:

Here is the procedure that will do the search using the new key word tables:

CREATE PROCEDURE ItemSearch
@KeyWords VARCHAR(256)
AS

SET NOCOUNT ON;

— We need to know how many words the user entered for the search.
DECLARE
@KWCount INT

— Temp table of passed in key words
CREATE TABLE #KW
( RecID INT IDENTITY(1, 1)
, KeyWord VARCHAR(50))

— Same function and same logic used to populate KeyWordItems table.
INSERT INTO #KW
( KeyWord )
SELECT ItemValue
FROM dbo.DelimitedSplit8KNEW(REPLACE(REPLACE(@KeyWords,’/’,’ ‘),’,’,”),’ ‘) AS Split
WHERE NOT EXISTS
(SELECT 1 FROM KeyWordsToExclude KWE
WHERE KWE.KeyWord = Split.ItemValue)

— Delete any duplicate words. Same as was done for the KeyWordItems
— table.
DELETE KW
FROM #KW KW
INNER JOIN
(SELECT MIN(RecID) AS MinRecID, KeyWord
FROM #KW
GROUP BY KeyWord
HAVING COUNT(*) > 1
) AS X ON
KW.KeyWord = X.KeyWord
AND KW.RecID X.MinRecID

— Number of words in the temp table.
SELECT @KWCount = X.RecCount
FROM
(SELECT COUNT(*) AS RecCount FROM #KW
) AS X

— Create a little stub table from the temp table joined to the
— KeyWordItems table. Using this stub table avoids
— joining to the larger Product table multiple times and then
— aggregating the results. The GROUP BY also does a sort on ProductID
— so that this stub table and the main Product table are eligible
— for a MERGE JOIN. You may have to index it in order to get a
— MERGE in the query plan.

— The HAVING part means that the product name must match the #KW table
— the same number of times as the count of words. This produces the
— same effect as the AND operator in the original LIKE query.

SELECT
KWI.ProductID
INTO #KWP — Stub table of Products with the matching key words.
FROM #KW
INNER JOIN KeyWordItems KWI ON
#KW.KeyWord = KWI.KeyWord
GROUP BY KWI.ProductID
HAVING COUNT(*) = @KWCount

— Return the requested Products
SELECT
P.ProductID, P.Name, P.Color, P.Size, P.ListPrice
FROM #KWP KWP
INNER JOIN Production.Product P ON
KWP.ProductID = P.ProductID
GO
Now we run the stored procedure for the same key words:

EXEC ItemSearch
@KeyWords = ‘nut hex’
This produces the same results as the original LIKE query. Here are the stats for the procedure:

Table ‘Product’. Scan count 0, logical reads 78, physical reads 2
, read-ahead reads 0, lob logical reads 0, lob physical reads 0
, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 175 ms.

The important part of this is that there is no Scan count on the Product table. Here is what the actual query plan looks like:

Note that we get a Clustered Index Seek on the Product table instead of a scan.

As I mentioned before, since AdventureWorks only has just over 500 products, this technique doesn’t do much. When I used it against an item table of auto parts of over 1.2 million the performance differences was astounding.

This technique is most useful for searches where all criteria (and AND condition) must be met in order to qualify for a match. I hope someone finds this useful.

This entry was posted in Access 2007 problems/solutions, Uncategorized and tagged . Bookmark the permalink.