Hybris SOLR query builders and search relevance
In SAP Hybris Commerce, the requests for the SOLR search engine are created by Query Builders. Simply put, these components convert the user queries to SOLR queries. Certainly, it is not possible to explain the search results if you don’t know what SOLR request was generated and why it contains particular conditions in the particular form. Unfortunately, the official documentation is very sparse and lack of examples. This article explains the differences between available query builders. You will also find a good compliment to one of the previous articles about the search relevancy.
Hybris has the following OOTB query builders:
“, your SOLR request will look like:
So the pattern is
Hybris uses multiMaxParser, so the largest score wins, in all pattern components (both f1…fN and EXACT/FUZZY/WILDCARD/PHRASE groups).
Result:
—
—
—
Note that the second and third words are considered as a single token here. Only double quotes work.
Phrase queries. In the example above, the phrase query is built from the original query by removing the double quotes. So the phrase query will look like
Boosting. It uses specific boost factors for the exact match, fuzzy match, wildcard match and phrase match and these factors are configurable in the backoffice. Fuzziness, sloppiness and wildcard query type is configurable too.
Sloppiness. A sloppy phrase query specifies a maximum “slop”, or the number of positions tokens need to be moved to get a match, or in other words, how many transpositions of the words needs to be done for the exact match. The slop is zero by default, requiring exact matches.
For example, “the President of first” with
will match the document containing “the first President of the USA is Washington“, but
won’t.
will work for the query “the President first“, for example.
Fizziness. Fuzziness is a similar thing, but for the letters of the tokens. It is the maximum allowed number of edits to match. For example, the
will match with
the fuziness=1.
For example, the SOLR query for the request
So the pattern is
All components are joined using OR. The request is very close to what hybris creates using the query builders (see examples above).
Let’s take on the relevancy calculation for two different cases:
This debug information shows that
, because it is a
, because it is a maximum of scores for
and
For example, we have the following documents in SOLR:
Let’s take the following request:
Different parsers will show the documents in different order:
eDisMax parser is based on LuceneQParser, so you will have the same scores and results with eDismax for this set.
- Default Free Text Query Builder.
- Multi-Field Free Text Query Builder.
- DisMax Free Text Query Builder.
Default Free Text Query Builder
The default query builder is the simplest in hybris. As its name says, this query builder is used by default. However, it uses hybris custom relevancy formula, multiMaxScore (see the last section for details). Example. If you search for ”full text string
- For each field defined as “Full text”:
- OR EXACT MATCH for ”
” (boosting X)full
- OR EXACT MATCH for ”
” (the same)text
- OR EXACT MATCH for ”
” (the same)string
- If “wildcard” flag is active for the field:
- OR WILDCARD MATCH for ”
” (boosting X/2)full*
- OR WILDCARD MATCH for ”
” (the same)text*
- OR WILDCARD MATCH for ”
” (the same)string*
- OR WILDCARD MATCH for ”
- If fuzzy search is active for the field:
- If the field type is “text”
- OR WILDCARD MATCH for ”
” (with the specified fizziness — if specified; boosting X/4)full~
- OR WILDCARD MATCH for ”
” (the same)text~
- OR WILDCARD MATCH for ”
” (the same)string~
- OR WILDCARD MATCH for ”
- If the field type is “text”
- If “phrase search” flag is active for the field:
- OR WILDCARD MATCH for ”
” (boosting X*2)full text string
- OR WILDCARD MATCH for ”
- OR EXACT MATCH for ”
- There is only one configurable boosting factor: for the field. Phrase, wildcard, fuzzy search and phrase search boosting factors depend on the field boosting factor and non-configurable (hard-coded).
word1 "word2 word3" word4will look like:
(
(code_string:word1^90.0) OR
(keywords_text_fr:word1^20.0) OR
...
(name_text_fr:word1^100.0)
) OR (
(code_string:"word2 word3"^90.0) OR
(keywords_text_fr:"word2 word3"^20.0) OR
...
(name_text_fr:"word2 word3"^100.0)
) OR (
(code_string:word4^90.0) OR
...
(name_text_fr:word4^100.0)
) OR (
(keywords_text_fr:word1~^10.0) OR
...
(name_text_fr:word1~^25.0)
) OR (
(keywords_text_fr:"word2 word3"~^10.0) OR
...
(name_text_fr:"word2 word3"~^25.0)
) OR (
(keywords_text_fr:word4~^10.0) OR
...
(name_text_fr:word4~^25.0)
) OR (
(code_string:word1*^45.0) OR
(ean_string:word1*^50.0)
) OR (
(code_string:"word2 word3"*^45.0) OR
(ean_string:"word2 word3"*^50.0)
) OR (
(code_string:word4*^45.0) OR
(ean_string:word4*^50.0)
) OR (
(keywords_text_fr:"word1 word2 word3 word4"^40.0) OR
...
(name_text_fr:"word1 word2 word3 word4"^100.0)
)
(code_string:word1^90.0) OR
(keywords_text_fr:word1^20.0) OR
...
(name_text_fr:word1^100.0)
) OR (
(code_string:"word2 word3"^90.0) OR
(keywords_text_fr:"word2 word3"^20.0) OR
...
(name_text_fr:"word2 word3"^100.0)
) OR (
(code_string:word4^90.0) OR
...
(name_text_fr:word4^100.0)
) OR (
(keywords_text_fr:word1~^10.0) OR
...
(name_text_fr:word1~^25.0)
) OR (
(keywords_text_fr:"word2 word3"~^10.0) OR
...
(name_text_fr:"word2 word3"~^25.0)
) OR (
(keywords_text_fr:word4~^10.0) OR
...
(name_text_fr:word4~^25.0)
) OR (
(code_string:word1*^45.0) OR
(ean_string:word1*^50.0)
) OR (
(code_string:"word2 word3"*^45.0) OR
(ean_string:"word2 word3"*^50.0)
) OR (
(code_string:word4*^45.0) OR
(ean_string:word4*^50.0)
) OR (
(keywords_text_fr:"word1 word2 word3 word4"^40.0) OR
...
(name_text_fr:"word1 word2 word3 word4"^100.0)
)
EXACT (f1,f2,...fN) OR FUZZY (f1,f2,...fN) OR WILDCARD (f1,f2,...fN) OR PHRASE (f1,f2,...fN).
Multi-Field Free Text Query Builder
According to the documentation, it builds the query in a way that the final score will be the sum the scores of all the subqueries. It is how SOLR works by default. However, it works differently than the default free text query builder in other aspects as well. Tokens. The builder tokenizes the user query splitting it by white space character. However, it also supports the quoted phrases as well. For example, User query:word1 "word2 word3" word4
word1
word2 word3
word4
word1 word2 word3 word4
slop=3
slop=2
Slop=2
persident
president
word1 "word2 word3" word4will look like:
(code_string:
(word1^90.0 OR
"word2 word3"^90.0 OR
word4^90.0 OR
word1*^45.0 OR
"word2 word3"*^45.0 OR
word4*^45.0)
) OR
(keywords_text_fr:
(word1^20.0 OR
"word2 word3"^20.0 OR
word4^20.0 OR
word1~^10.0 OR
"word2 word3"~^10.0 OR
word4~^10.0 OR
"word1 word2 word3 word4"^40.0)
) OR
...
(name_text_fr:
(word1^100.0 OR
"word2 word3"^100.0 OR
word4^100.0 OR
word1~^25.0 OR
"word2 word3"~^25.0 OR
word4~^25.0 OR
"word1 word2 word3 word4"^100.0)
)
(word1^90.0 OR
"word2 word3"^90.0 OR
word4^90.0 OR
word1*^45.0 OR
"word2 word3"*^45.0 OR
word4*^45.0)
) OR
(keywords_text_fr:
(word1^20.0 OR
"word2 word3"^20.0 OR
word4^20.0 OR
word1~^10.0 OR
"word2 word3"~^10.0 OR
word4~^10.0 OR
"word1 word2 word3 word4"^40.0)
) OR
...
(name_text_fr:
(word1^100.0 OR
"word2 word3"^100.0 OR
word4^100.0 OR
word1~^25.0 OR
"word2 word3"~^25.0 OR
word4~^25.0 OR
"word1 word2 word3 word4"^100.0)
)
f1 (EXACT, FUZZY, WILDCARD, PHRASE) OR f2 (EXACT, FUZZY, WILDCARD, PHRASE) ... OR FN (EXACT, FUZZY, WILDCARD, PHRASE).
DisMax Free Text Query Builder
Similar to the previous one, but it groups some of the subqueries. The score for the group will be the maximum score of the subqueries that belong to that group (and not the sum). Hybris uses its custom relevancy formula (multiMaxScore). The details on this topic see the last section of this article. The Dismax Query Builder also supports quotes in the query, boosting, sloppiness and fuzziness. This query builder supports the parameters “groupByQueryType” and “tie”.- Group By Query Type. It changes the way of grouping disjunction max queries. If set to true, it also groups queries by type (where the types are: free text query, free text fuzzy query, free text wildcard query).
- Tie. The “tie” parameter defines how much the final score of the query will be influenced by the scores of the lower scoring fields compared to the highest scoring field: “0.0” makes a query a pure “disjunction max query”, “1.0” makes the query a pure “disjunction sum query” where it doesn’t matter what the maximum scoring sub query is.
Understanding MultiMax Query Parser
Hybris uses custom Multimax query parser developed by SAP for two query builders, DisMax and Default. The plugin is very simple, but you need to know that it modifies the way the score is calculated. The easiest way of explaining it is demonstrating the internals by example. Let’s take the following sample documents for experimenting: Document #1.- id: “doc1”
- title_text_en: “the first President of the USA is Washington titleA”
- description_text_en: “the first President of the USA is Washington”
- id: “doc2”
- title_text_en: “the second President of the USA is John Adams titleB”
- description_text_en: “the second President of the USA is John Adams”
- id: “doc3”
- title_text_en: “the first head of the USA is Washington titleC”
- description_text_en: “the first head of the USA is Washington”
(title_text_en:"first" OR description_text_en:first) OR (title_text_en:"titleC" OR description_text_en:"titleC")
- Default parser (LuceneQParser)
- Hybris custom parser (multiMaxScoreParser)
- LuceneQParser calculates the score for each subquery and sum them up to get the total query.
- multiMaxScoreParser sums up the score of the subqueries. However, it doesn’t sum up the scores from each component of the subquery.
score = 0.55
score = 0.55
title_text_en:first
description_text_en:first
title_text_en:"first" OR description_text_en:first
Default Query Builder Example
Multi-Field Query Builder Example
Note that the order of the documents is a bit different because of different way of grouping and calculating subqueries. To sum up,- Default Query builder uses only one boosting factor, all other are build based on this one. It doesn’t recognize quotes in the query. It uses multiMaxScore instead of the SOLR default LuceneQ.
- Multi-field query builder doesn’t use multiMaxScore. It recognizes quotes in the query. Supports exact match, phrase, fuzzy and wildcard boosts, fuzziness and sloppiness.
- Dismax query builder uses multiMaxScore, recognizes scores. Supports exact match, phrase, fuzzy and wildcard boosts, fuzziness and sloppiness.
© Rauf Aliev, August 2017
Navigation | hybrismart | SAP hybris under the hood
16 August 2017 at 15:26
[…] SOLR (partial update, multi-line product search, static pages and products in the same list, solr 6 in 5.x, 90M personalized prices, 500K availability groups, solr cloud, highlighting, 2M products/marketplace, more like this, concept-aware search: automatic facet discovery), explaining relevance ranking for phrase queries, enhanced multi-word synonyms and phrase search, query builders and multiMaxScore […]
Grzegorz Lebek
29 August 2017 at 06:58
One comment on that. DefaultFreeTextQueryBuilder is just an alias for a DisMaxFreeTextQueryBuilder so it behaves the same (e.g it’s possible to configure boosts in the backoffice for the default builder). This DefaultFreeTextQueryBuilder behavior (with hardcoded boosts) could be valid maybe for a legacy mode (though there was no query builders back then) but I am not entirely sure.
Rauf Aliev
29 August 2017 at 09:46
Let’s double check.. when I was experimenting, I clearly saw hardcoded boosts and difference behavior with the default settings against dismaxbuilder. The source code also says that they have different logic, not a simple alias. Once I come back to my laptop, I will check it again
Ramy
30 August 2017 at 10:41
Hi the list of query builders in the backoffice is just a list of beans, and by looking into the bean definition we can find that both the DefaultFreeTextQueryBuilder and DisMaxFreeTextQueryBuilder beans points to the same implementation, which is DisMaxFreeTextQueryBuilder, so basically no difference in the behavior between the two beans.