Hybris SOLR query builders and search relevance

In SAP Hybris Commerce, the requests for the SOLR search engine are created by Query Builders. Simply put, these components convert the user queries to SOLR queries. Certainly, it is not possible to explain the search results if you don’t know what SOLR request was generated and why it contains particular conditions in the particular form.   Unfortunately, the official documentation is very sparse and lack of examples. This article explains the differences between available query builders. You will also find a good compliment to one of the previous articles about the search relevancy.

Hybris has the following OOTB query builders:

  • Default Free Text Query Builder
  • Multi-Field Free Text Query Builder
  • DisMax Free Text Query Builder.

Hybris uses the custom relevance formula for two of these builders, Default and DisMax. For the Multi-field builder, hybris uses the SOLR default formula (LuceneQParser). The details on this topic, see the last section of this article.

Default Free Text Query Builder

The default query builder is the simplest in hybris. As its name says, this query builder is used by default. However, it uses hybris custom relevancy formula, multiMaxScore (see the last section for details).

Example. If you search for “full text string“, your SOLR request will look like:

  • For each field defined as “Full text”:
    • OR EXACT MATCH for “full” (boosting X)
    • OR EXACT MATCH for “text” (the same)
    • OR EXACT MATCH for “string” (the same)
    • If  “wildcard” flag is active for the field:
      • OR WILDCARD MATCH for “full*” (boosting X/2)
      • OR WILDCARD MATCH for “text*” (the same)
      • OR WILDCARD MATCH for ” string*” (the same)
    • If fuzzy search is active for the field:
      • If the field type is “text”
        • OR WILDCARD MATCH for “full~” (with the specified fizziness — if specified; boosting X/4)
        • OR WILDCARD MATCH for “text~” (the same)
        • OR WILDCARD MATCH for ” string~” (the same)
    • If  “phrase search” flag is active for the field:
      • OR WILDCARD MATCH for “full text string”  (boosting X*2)

Note that:

  • There is only one configurable boosting factor: for the field. Phrase, wildcard, fuzzy search and phrase search boosting factors depend on the field boosting factor and non-configurable (hard-coded).

For example, the SOLR query for the request

word1 "word2 word3" word4

will look like:

(
 (code_string:word1^90.0) OR
 (keywords_text_fr:word1^20.0) OR
 ...
 (name_text_fr:word1^100.0)
) OR (
 (code_string:"word2 word3"^90.0) OR
 (keywords_text_fr:"word2 word3"^20.0) OR
 ...
 (name_text_fr:"word2 word3"^100.0)
) OR (
 (code_string:word4^90.0) OR
 ...
 (name_text_fr:word4^100.0)

) OR (
 (keywords_text_fr:word1~^10.0) OR
 ...
 (name_text_fr:word1~^25.0)
) OR (
 (keywords_text_fr:"word2 word3"~^10.0) OR
 ...
 (name_text_fr:"word2 word3"~^25.0)
) OR (
 (keywords_text_fr:word4~^10.0) OR
 ...
 (name_text_fr:word4~^25.0)
) OR (

 (code_string:word1*^45.0) OR
 (ean_string:word1*^50.0)
) OR (
 (code_string:"word2 word3"*^45.0) OR
 (ean_string:"word2 word3"*^50.0)
) OR (
 (code_string:word4*^45.0) OR
 (ean_string:word4*^50.0)
) OR (
 (keywords_text_fr:"word1 word2 word3 word4"^40.0) OR
 ...
 (name_text_fr:"word1 word2 word3 word4"^100.0)
)

So the pattern is

EXACT (f1,f2,...fN) OR FUZZY (f1,f2,...fN) OR WILDCARD (f1,f2,...fN) OR PHRASE (f1,f2,...fN).

Hybris uses multiMaxParser, so the largest score wins, in all pattern components (both f1…fN and EXACT/FUZZY/WILDCARD/PHRASE groups).

Multi-Field Free Text Query Builder

According to the documentation, it builds the query in a way that the final score will be the sum the scores of all the subqueries. It is how SOLR works by default.

However, it works differently than the default free text query builder in other aspects as well.

Tokens. The builder tokenizes the user query splitting it by white space character. However, it also supports the quoted phrases as well. For example,

User query: word1 "word2 word3" word4
Result:
word1
word2 word3
word4

Note that the second and third words are considered as a single token here. Only double quotes work.

Phrase queries. In the example above, the phrase query is built from the original query by removing the double quotes. So the phrase query will look like word1 word2 word3 word4

Boosting. It uses specific boost factors for the exact match, fuzzy match, wildcard match and phrase match and these factors are configurable in the backoffice. Fuzziness, sloppiness and wildcard query type is configurable too.

Sloppiness. A sloppy phrase query specifies a maximum “slop”, or the number of positions tokens need to be moved to get a match, or in other words, how many transpositions of the words needs to be done for the exact match. The slop is zero by default, requiring exact matches.

For example, “the President of first” with slop=3 will match the document containing “the first President of the USA is Washington“, but slop=2 won’t. Slop=2 will work for the query “the President first“, for example.

Fizziness. Fuzziness is a similar thing, but for the letters of the tokens. It is the maximum allowed number of edits to match. For example, the persident will match withpresident the fuziness=1.

For example, the SOLR query for the request

word1 "word2 word3" word4

will look like:

(code_string:
  (word1^90.0 OR
  "word2 word3"^90.0 OR
   word4^90.0 OR
   word1*^45.0 OR
  "word2 word3"*^45.0 OR
   word4*^45.0)
) OR
(keywords_text_fr:
  (word1^20.0 OR
  "word2 word3"^20.0 OR
   word4^20.0 OR
   word1~^10.0 OR
  "word2 word3"~^10.0 OR
   word4~^10.0 OR
  "word1 word2 word3 word4"^40.0)
) OR
...
(name_text_fr:
  (word1^100.0 OR
  "word2 word3"^100.0 OR
   word4^100.0 OR
   word1~^25.0 OR
  "word2 word3"~^25.0 OR
   word4~^25.0 OR
  "word1 word2 word3 word4"^100.0)
)

So the pattern is

f1 (EXACT, FUZZY, WILDCARD, PHRASE) OR f2 (EXACT, FUZZY, WILDCARD, PHRASE) ... OR FN (EXACT, FUZZY, WILDCARD, PHRASE).

DisMax Free Text Query Builder

Similar to the previous one, but it groups some of the subqueries. The score for the group will be the maximum score of the subqueries that belong to that group (and not the sum). Hybris uses its custom relevancy formula (multiMaxScore). The details on this topic see the last section of this article.

The Dismax Query Builder also supports quotes in the query, boosting, sloppiness and fuzziness.

This query builder supports the parameters “groupByQueryType” and “tie”.

  • Group By Query Type. It changes the way of grouping disjunction max queries. If set to true, it also groups queries by type (where the types are: free text query, free text fuzzy query, free text wildcard query).
  • Tie. The “tie” parameter defines how much the final score of the query will be influenced by the scores of the lower scoring fields compared to the highest scoring field: “0.0” makes a query a pure “disjunction max query”, “1.0” makes the query a pure “disjunction sum query” where it doesn’t matter what the maximum scoring sub query is.

Understanding MultiMax Query Parser

Hybris uses custom Multimax query parser developed by SAP for two query builders, DisMax and Default. The plugin is very simple, but you need to know that it modifies the way the score is calculated.

The easiest way of explaining it is demonstrating the internals by example.

Let’s take the following sample documents for experimenting:

Document #1. 

  • id: “doc1”
  • title_text_en: “the first President of the USA is Washington titleA”
  • description_text_en: “the first President of the USA is Washington”

Document #2

  • id: “doc2”
  • title_text_en: “the second President of the USA is John Adams titleB”
  • description_text_en: “the second President of the USA is John Adams”

Document #3

  • id: “doc3”
  • title_text_en: “the first head of the USA is Washington titleC”
  • description_text_en: “the first head of the USA is Washington”

Take the following sample request:

(title_text_en:"first" OR description_text_en:first) OR (title_text_en:"titleC" OR description_text_en:"titleC")

All components are joined using OR. The request is very close to what hybris creates using the query builders (see examples above).

Let’s take on the relevancy calculation for two different cases:

  • Default parser (LuceneQParser)
  • Hybris custom parser (multiMaxScoreParser)

and compare the results.

The screenshots below may look difficult to understand. Don’t read everything – just look through. Note that multiMaxScore uses a max function where LuceneQParser uses a sum function. This is a key difference between the custom and default query parsers.

2017-08-11_16h28_08.png
2017-08-11_16h34_28.png
2017-08-11_16h30_31.png

This debug information shows that

  • LuceneQParser calculates the score for each subquery and sum them up to get the total query.
  • multiMaxScoreParser sums up the score of the subqueries. However, it doesn’t sum up the scores from each component of the subquery.

The latest statement means that in the default hybris implementation of the scoring formula and with the dismax/multimax parsers, it may not be important how many fields contain the particular token. For the particular token and particular field, the score depends on global and local frequency of the term and the field length. I used “may not” because there are other components of the formula that makes the dependency indirect.

For example, “first” is used in both fields, in the name and in the description. Hybris formula, multiMaxScore, calculates the score = 0.55, because it is a score = 0.55 , because it is a maximum of scores for title_text_en:first and description_text_en:first

For example, we have the following documents in SOLR:

2017-08-11_17h51_59.png

Let’s take the following request:

title_text_en:"first" OR description_text_en:first

Different parsers will show the documents in different order:

2017-08-11_18h03_29

eDisMax parser is based on LuceneQParser, so you will have the same scores and results with eDismax for this set.

Default Query Builder Example

2017-08-11_18h55_48.png

Multi-Field Query Builder Example

Note that the order of the documents is a bit different because of different way of grouping and calculating subqueries.

2017-08-11_18h44_31.png

To sum up,

  • Default Query builder uses only one boosting factor, all other are build based on this one. It doesn’t recognize quotes in the query. It uses multiMaxScore instead of the SOLR default LuceneQ.
  • Multi-field query builder doesn’t use multiMaxScore. It recognizes quotes in the query. Supports exact match, phrase, fuzzy and wildcard boosts, fuzziness and sloppiness.
  • Dismax query builder uses multiMaxScore, recognizes scores. Supports exact match, phrase, fuzzy and wildcard boosts, fuzziness and sloppiness.

 

4 comments

  1. […] SOLR (partial update, multi-line product search, static pages and products in the same list, solr 6 in 5.x, 90M personalized prices, 500K availability groups, solr cloud, highlighting, 2M products/marketplace, more like this, concept-aware search: automatic facet discovery), explaining relevance ranking for phrase queries, enhanced multi-word synonyms and phrase search, query builders and multiMaxScore […]

    Like

  2. Grzegorz Lebek · · Reply

    One comment on that. DefaultFreeTextQueryBuilder is just an alias for a DisMaxFreeTextQueryBuilder so it behaves the same (e.g it’s possible to configure boosts in the backoffice for the default builder). This DefaultFreeTextQueryBuilder behavior (with hardcoded boosts) could be valid maybe for a legacy mode (though there was no query builders back then) but I am not entirely sure.

    Like

    1. Let’s double check.. when I was experimenting, I clearly saw hardcoded boosts and difference behavior with the default settings against dismaxbuilder. The source code also says that they have different logic, not a simple alias. Once I come back to my laptop, I will check it again

      Like

  3. Ramy · · Reply

    Hi the list of query builders in the backoffice is just a list of beans, and by looking into the bean definition we can find that both the DefaultFreeTextQueryBuilder and DisMaxFreeTextQueryBuilder beans points to the same implementation, which is DisMaxFreeTextQueryBuilder, so basically no difference in the behavior between the two beans.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: