Enhanced multi-word synonyms and phrase search in hybris
A synonym is a word having the same or nearly the same meaning as another word or other words in a language. In any search engine, handling synonyms is very important. In SAP Hybris Commerce, synonyms are handled by Apache SOLR, the built-in search engine. There are issues with the default implementation. This article explains on how to overcome them.
The phrase search is a feature that moves the documents (products in our case) higher in the search results if their attributes contain the exact request. The default implementation is also very basic.
In this article, I present my PoC that demonstrates the better synonym handling and enhanced phrase search in SAP hybris Commerce.
There is a well-known limitation on how SOLR works with the synonyms. The one-word synonyms are processed nicely, but when we try to use multi-word synonyms, you will definitely face the issues. Simply put, the multi-word synonyms aren’t working as expected. There’s a reason for that: the synonym module start working after tokenization part is done.
Let’s look at the example.
There are two ways on how SOLR handles synonyms: index-time and query-time synonyms processing. In the default hybris configuration, both are configured in the same way.
In the example above, we may want to create the following synonyms:
I would like to highlight the following points:
- Synonyms may be equivalent in terms of their role (primary/secondary)
- Synonyms may have more than one word in both sides.
- Synonyms may have stopwords (such as articles) and special symbols (such as punctuation marks) as an integral part of them.
SOLR built-in synonyms processing module works with the token stream. There are two modes of the synonym processing OOTB:
Query-time expansion simply replaces words in the query token stream with their synonyms.
However, it has the negative side effects. First, the IDF of rare synonyms will be boosted, causing unintuitive results. The documents containing rare words are higher in the solr search results. It is quite natural, but with the synonyms it may confuse the user, putting documents with the original high-frequency word deeper in the search results.
The details of the problems are greatly explained in this article.
Index-time synonym expansion is an alternative to query-time expansion, but expanding synonyms at index-time has two major problems.
First problem is called ‘sausagization’. The roots of this problem are in the way of how lucene works.
For example, if index-time synonym expansion “USA => United States” is performed on a document “
“, it will be indexed with “United” and “USA” occupying one position, and “States” and “Washington” occupying the next (see below). As a result, phrase query “UThe first president of the United States of America” will not match this document, and phrase query “United is Washington” will improperly match.
Second, to modify index-time synonym expansion, you have to completely re-index. For the large indexes, it may take too much time.
To demonstrate how index-time expansion works, let’s index the following document:
and add the following synonym rule to SOLR:
the list of the terms for the field “description_text_en” will have the following terms:
(The words are shortened by stemming filter; stopwords are removed by stopwords filter; “of” if not a stopword in the default configuration; synonyms are added instead of “usa”, so “USA” in not in the index anymore)
If query-time expansion is not active, “USA” will lead to zero results, because there is no such term in the document, but “America” will. With query-time expansion on, “usa” will work as well.
In hybris, phrase match can be boosted, and the phrase boost factor is configurable. However, there is an important point on what the phrase is. If you search for “the first president of the United States”, the document containing these words can be boosted (placed higher in the search results) only if the document contains all these words in the exact order. For example, if you use the phrase “Washington is the first president of the United States”, the document containing the whole phrase will take the first position, and it is a right thing, but the documents containing the parts of this phrase (“first president”, “president of the United States” etc.) won’t be boosted at all.
In my solution, the query processing was moved from SOLR to hybris. In SOLR, the filters are off both in indexing and query modes.
The module adds partial phrases to the query and their synonyms.
For example, for the query ”
“, the hybris OOTB builds the query for SOLR using the following approach:
You see that only the whole query is considered as a phrase. There is no such thing as “sub-phrase” with the higher boost factor.
In my solution, I create sub-phrases for the request:
After that, my code finds the exact match of the synonyms for the words and sub-phrases and creates the additional phrases for the search:
In the example above, “the USA” was found in the list of synonyms, and new queries were added in the list. Note that these extra phrases contain the synonyms in different combinations with the words from the original phrase (“the USA”).
The phrase matches (last 17 items) has higher boost factors than word matches (first five items). Among the phrases, longer phrases have higher boost factors than shorter phrases.
There is a special case when the synonyms are overlapped:
However, the replacements can’t be used to create new replacements:
In the example below, there is a rule saying that “Flip-flops” (or flip flops) and “Sandals” are synonyms. Flip-flops contains two tokens (two words -> one word), so the OOTB synonyms won’t work. In this solution, the custom query contains both “sandals” and “flip-flops” (or “flip flops”).
In the example below, “slippers” and “slip on” are synonyms (here we see the rule “one word -> two words”):
In the example below, I demonstrate how sub-phrase search works. If you search “men t-shirt logo”, in the default hybris configuration, you will get all products containing “men”, “t-shirt” and “logo”, but these products will be sorted by relevancy, taking into account how many times these words were used in the product attributes and how rare they are.
In the default configuration, you will have the following products for the request “men t-shirt logo”:
Note that the black t-shirt is on the second place. In my solution, this product is higher in the search results because its name contains “t-shirt logo” as a sub-phrase of the original request (“men t-shirt logo”). (This example is not the best because of the demo database that comes with hybris is small and simple; I can’t use anything else than hybris provides OOTB).
© Rauf Aliev, August 2017