Concept-aware search: automatic search facet discovery in SAP hybris
I would like to introduce my new PoC for automatic facet discovery. It sets up the facets based on the customer requests, the words used in the search query. For example, “blue armada jacket XXL” will show the products with a keyword “jacket” with three facets automatically set up, color=blue, brand=armada and size=XXL. You can also find a video below demonstrating how it works on top of hybris accelerators.
Faceted search is a critical feature for enhancing user search experience and a vital part of any modern e-shops. From the user perspective, faceted search breaks up search results into multiple categories, showing counts for each, and allows the user to “drill down” or further restrict their search results based on those facets.
So it is clear that they are extremely useful when working with large amounts of data: they improve finability, eliminate frustration, provide a guided means to navigate, or drill, and in any order. Most importantly, facets provide relevant landing pages for long tail keywords, just as category-based navigation has done for search marketers for ages.
What people search
According to the research of Baymard.com, there are 12 query types. The most of them are not well supported by the search engines out-of-the-box.
- Exact searches. Searching for specific products by title or model number. Example: Keurig K45.
- Product type searches. Searching for groups of whole categories of products. Example: Sandals.
- Symptom searches. Searching for products by querying for the problem they must solve in hopes of being presented with viable solutions and products to this problem. Examples: “stained rug” or “dry cough”.
- Non-product search. Searching for help pages, company information, and other non-product pages, such as the return policy or shipping information.
- Feature searches. Searching for products with specific attributes or features. Example: Waterproof cameras.
- Thematic searches. Searching for categories or concepts that are vague in nature or have “fuzzy” boundaries. “Living room rug”.
- Relational searches. Searching for products by their affiliation with another object. Movies starring Tom Hanks.
- Compatibility Search. Searching for products by their compatibility with another item. Lenses for Nikon D7000.
- Subjective Search. Searching for products using non-objective qualifiers. “High-quality kettles”.
- Slang, Abbreviation, and Symbol Searches. Searching for products using various linguistic shortcuts. Sleeping bag -10 deg.
- Implicit Search. Forgetting to include certain qualifiers in the search query due to one’s current frame of mind. [Women’s] Pants
- Natural Language Search. Searching in full sentences rather than bundles of keywords. Women’s shoes that are red and available in size 7.5
Hybris supports the most of these types, but this support is not smart in terms of technologies. Neither hybris, nor SOLR can recognize the request and associate the keywords with the concepts, such as features or product categories. Hybris treats all keywords as keywords.
The most of the listed types can be covered by facets. Certainly, it may add a lot of tagging to make it work, but eventually you will amend the customer experience significantly. My PoC is a bridge between full text search requests and facet search.
There are some well-known problems with the facet navigation. Hybris displays facets relevant for the user query, but the query itself may contain some words that make this facet disappear. In the example above, “blue armada jacket XXL” all four words are considered by search engine as free text search request, and it displays the products having all four words in their fields. However, some facets can be stored in different format internally, and you need to create duplicate fields for indexing their text representation. That is why hybris creates two fields, categoryName (“Armada”) and category (code 584 that internally means “Armada”).
The problem is that the results are not what the customer expects. “blue armada jacket XXL” displays all products having “blue”, “armada”, “jacket” and “XXL” in the name or description. That is why the most of e-shops use the product properties in the title for findability.
So, in order to find all blue female jackets of the XXL size and the brand “Burton”, the customer should:
- perform a search using the free text query “blue female XL Burton jacket“
- scroll down to the “Colors” facet and click to “blue”. If the list is long, the customer needs to click to the link “More” first.
- scroll down to the “Brand” facet and click to “Burton”. if the list is long..
- scroll down to “Size” facet and click to “XL”. This list is normally not long.
- scroll down to “Gender” facet and click to “Female”. This list is normally also not long)
So the customer needs to do several clicks to get to the product they want. If they product has all three words in the title, this product may be displayed on the first page, but it really depends on the words. Some words are too generic, like colors or sizes, and they may be used in different contexts in the product description.
Google Shopping works as explained above. If you type “Giro helmet below $50”, Google will set the facets accordingly.
For example, in the default setup of hybris Commerce, the request “blue female XL Burton jacket” leads to the following results:
Only one product is a jacket. All four products are not blue. All models are not for women. Only one product is from Burton (the first one).
My PoC shows the following when requested with the same query:
As you see, all products are now jackets. All products are blue (I doubt that the third product is blue, but SAP tagged it as blue). All four are female jackets. And all products are of the Burton brand.
Let’s take electronics store. The request is “fixed camera lenses from canon”:
All products are not camera lenses. And definitely not fixed lenses. Two cameras and a monopod.
My PoC shows the following:
All three products are fixed lenses from Canon.
It can also recognize ranges. For example, the request “5 mp kodak camera” will display all Kodak cameras with 5 Mp by selecting a range 5-5.9 Mp:
Should the query replacement be automatic?
In my PoC, it is automatic. However, for the real project, my recommendation is to conduct A/B testing to find out if the automatic facet discovery works or not for the particular business case. Product types, catalog size, customer profiles count for making the right decision.
One of the examples of non-automatic approach is to add a one-click automatic suggestion displayed next to the hybris OOTB search results:
Certainly, the design above is quick and dirty and the panel eats too much space in this form. If you want to go with this approach, the information needs to be compact and informative.
Technical details and architecture
The system analyzes the query and extracts facet information from the user input. For example, “Canon flash memory” can’t set up both “Brand=Canon” and “Category=Flash memory” because Canon doesn’t have any flash memory cards in the catalog. So the system should make a decision, what is more important for the customer, all Canon products or all flash memory products. In addition to that, the system may show all Canon Flashes by ignoring the “memory” keyword. For example, the customer may want to see both canon flashes and memory in the same list. So it is obvious that the decision is tough for the computer brains, because they know nothing about the real customer intent.
However, when the products having both attributes (a brand and category) are available, the customer intent is clear and the search facets can be configured automatically. For example, we have six Sony Flash Memory products available in the demo catalog, so they should be displayed as a result of “Sony Flash Memory”. The next screenshot shows the results for “Sony Flash memory 32Gb“.
So the system keeps all the facet values in the memory and use them to map keywords from the request to the specific facet. These facets are built automatically by SOLR based on the documents uploaded by hybris, so the most convinient way to get these lists uniformly is to request the SOLR where all of them are stored. There is a OOTB request handler in SOLR called “terms” for that:
There is one drawback: it works nicely only with KeywordTokenizer (to keep the words together in the multi-word facets) and without stemming filters (to keep the original words; Stemming Filters reduce the words in their root or base forms, the stem). However, using SOLR configuration you can create copies of the original facet filters without stemming filters and tokenizers. The simplest approach is to change the type of these fields from “text” to “string” in the hybris configuration. However, it slightly affects full text search.
What facets we need to process? there are two options: all facets or only those returned by the original request. I used the second approach.
For example, the request “Cheap blue XXL jacket” shows the following facets:
- AvailableInStores (50)
- Price (7)
- Colors (11)
- Size (30 )
- Gender (2 )
- Collection (17)
- Category (49)
- Brand (44)
The system recognizes “Blue” as a color, because it is one of the values of the facet “Colors”, and XXL as a size. The list of the scanned facets can be reduced by configuration. The list of the facets are taken from the full text request “Cheap blue XXL jacket”.
So, we map the words from the request with the facets from the SOLR response:
The remaining words are categorized as special (“cheap”), stopwords (“from”, “for” etc) or free text search keywords (“jacket”).
You can use a list of synonyms or language processing to find the proper facet values as well. For example, “jacket” can be recognized as the category “Jackets”. In my PoC only exact matches work, so “jacket” is a simple keyword, but “Snow Jackets” is recognized as a category. “Snow jacket” (singlular) is recognized as “keyword=jacket, category=Snow” because there is a category named “Snow”. This case may confuse, because the category Snow may contain something else than jackets. If so, my PoC won’t use “Snow” as a category, because there are no jackets in it, and it will show the results for “snow jackets” as OOTB hybris does. Anyway, it should be well tested with the real data to avoid any issues.
What if your category name contains the words from the list of other facet values? For example, you have a brand “Red Hat” and a color “red”, and a customer request “big red hat”. So there are two options:
- Brand = Red Hat, keywords = big
- Color = Red, category = “Hats” (using Hat=Hats from the synonyms), keywords = Big
In my PoC, the system will count the results for both options. If one of them has zero results, and another is positive, the system will go with the another option. If both options are positive, the system will go with the last one (which is random in fact). If both requests have zero results, the system will show all products having all three keywords in the full text fields (OOTB). There are other ways on how to implement the business logic. For example, the system can ask the customer, “Are you looking for a brand “Red hat” or category “Hats”?
For example, in the hybris OOTB demo apparel store, there is a brand “RED” and a color “red”. If you try to search “red” in my PoC, you will get products of the RED brand. However, if you try to search for “red shirts”, you will have “Color=red; Category=Shirts” (7 products and all of them are red shirts), because “Category=Shirts;Category=RED” has zero results.
This solution supports dynamic facets in hybris that means that you don’t need to reconfigure it once you add, remove or change the facets. However, for some facet types you need to make specific corrections. For example, for the facet “Size”, the customer may use the phrase “L size” instead of “L”. The word “size” is redundant in the search request and needs to be removed. This word can be put before “L”, and in this case this word needs to be removed. However, in any other context, “size” is valuable for the search and should be considered as a keyword.
“with” and “without”, “or” and “and” are also special words, and my PoC doesn’t support them purposely. Using natural language requires a different approach to understanding the query. I am experimenting with OpenNLP these days and I hope I will come back with the PoC soon 🙂
© Rauf Aliev, June 2017