Using AI to Enhance Keyword Searches

The newest option available in both the Data Portal plugin and the API is to use AI to enhance your keyword searches. This can be enabled under the Search Results panel or by using “TermMatch”: “ai” in the API. This feature vastly improves the relevancy or your search results and supports most natural language queries as well as spelling errors.

How It Works

We have taken the taxonomy terms and using OpenAI ChatGPT calculated their embeddings. Embeddings are representations that are designed to be consumed by machine learning models and semantic search algorithms. In most cases, the embeddings are collections of numbers and are often arranged in a vector to simplify their representation.

When a running a keyword search, we use OpenAI to calculate the embedding of the query. We can then compare this to the embeddings of the taxonomy terms and find the most similar terms. Additionally, we compare the returned terms to each other and eliminate the most dis-similar ones.

Once we have taxonomy that is relevant to the user’s keyword search, these are added to our existing search algorithm to retrieve matches using the traditional keyword matching and now the matching taxonomy. The retrieved results are also scored depending on how well they match using our existing scoring algorithm with a few minor changes.

Tests have shown promising results and in almost all cases the results are more relevant. Work continues on optimizing this code and we have added a few parameters (only available in the advanced options of Data Portal) to select the number of matching terms and the minimum similarity value.

In the example where a user searches for “I am hungry”, traditionally very few matches would appear as “i” and “am” are useless and ignored and “hungry” isn’t a word that is found in many agency/program names or descriptions. With this AI keyword enhancement, “I am hungry” will retrieve the taxonomy terms: Hunger/Food Issues, Food, Meals, Emergency Food, Overeating/Food Addiction and these will be used with the search.  In cases where a specific agency or program name is searched the results will still show that agency/program as the top result since we are still using our existing search logic and scoring algorithm.

Performance

It is important to note that using this tool will add about 1 second to your search results page load time. This tool will cache the searches so that exact subsequent searches appear in a fraction of a second. Additionally using caching tools with your site will help keep page load times to a minimum.

Privacy

The search keywords are send to OpenAI to retrieve the embedding values. No other data is exposed (ip addresses, locations, etc) and the cache and logs are purged regularly. This is more private than using Google Analytics on your site which tracks everything a user does and searches for on your site (and across the Internet for that matter).