«More Like This» in hybris Solr search

Category: Other Author: Rauf ALIEV 5 February 2017 No comments

The MoreLikeThis search component (MLT) enables users to query for documents similar to a document in their result list. Solr has MLT module since version 1.3, but Hybris doesn’t use it at all. I was wondering if it is possible to leverage MLT in hybris and how good the results will be. I managed to create a working prototype, but the apparent simplicity of MLT integration nevertheless obscures some unexpected challenges and problems. I haven’t found a best solution for the task yet as I see it, so but some findings and preliminary results worth spreading in our community to help others go this way. MoreLikeThis component fetches the products with the similar term vectors. A term vector is a data structure that holds a list of all the words that were in the field and the number of times each word was used (excluding words that it considers to be “stop words”). SOLR loops over the fields and retrieves term vectors for each of the fields in the document we’re analyzing. For each of the terms, SOLR finds the field that contains the most instances of the given term and then calculates the score. Then the module selects the top K terms with highest score to form a disjunctive query of these terms. Simply put, it displays the products that have the closest set of top words.

For example, for the product “CAMERA TAPE DIGITAL 90MIN 2PK” the system creates a request:

fulltext_en: Any of these terms:
- 500, 825, 860, camcord, cld, clean, comput, digit, digital8, ideal, line, loss, make, min, mode, perfect, possibl, record, resolut, tape, transfer, v825cld, videotap
category_string_mv: Any of these terms:
- 585
- 604

These words are most frequently used terms in the indexed product attributes. Let’s try to see similar products for DIGITAL CAMERA TRIPOD:

Right window show results for fulltext_en =(1.35, 135, 20.9, 209, adjust, aluminum, anod, attach, bubbl, camera, easi, feet, fold, head, leg, lock, nylon, plate, read, rubber, run, skid, stand, tall, tripod).

Challenges

Fetching recommendations

First, hybris uses SOLR 6.1 and SolrJ library for 6.1. In this version of the library MoreLikeThis is not supported. It means that the library is not able to parse the SOLR response correctly and hybris ignores the section MoreLikeThis completely. hybris is not tested with solrj > 6.1, so it is risky to replace the library with the version 6.3 where these methods are supported. However, if you are able to test it thoroughly, it might be a solution. In my PoC, I use the existing library capabilities to parse the Solr response that is not supported natively. Specifically, I used SolrSearchResult.getSolrObject().getResponse().get(“moreLikeThis”)) to access to the list of recommendations. To set up request parameters, I used

searchQuery.addRawParam("mlt", "true");
searchQuery.addRawParam("mlt.fl", "fulltext_en,catalogVersion,category_string_mv");
searchQuery.addRawParam("mlt.count", "10"); //count of documents in the response
searchQuery.addRawParam("mlt.mindf", "2"); //min document frequency
searchQuery.addRawParam("mlt.mintf", "1"); //min term frequency

Facets

This module is poorly documented, and it seems that faceting had worked before, but in the last versions of Solr they are no longer work with the results provided by the module. Some sources say that it should work if the module is used as a request handler, but it seems that at least in hybris solr 6.1 it is not true. However, you can use (parse) the request generated by MoreLikeThis, and execute it via regular solr select method. In this case the facets will be supported because it is a normal way of fetching data.

Catalog versions

Catalog versions are not supported by MoreLikeThis, because it knows nothing about hybris 🙂 So the recommendations contain the results from both catalog versions. The solution is to filter it before displaying. For the simplest case, with two catalog versions, Online and Staged, it is not rocket science.

Field types

The fields on which to perform MLT must be indexed and of type

string

. MLT is not designed to work with the double values (“similar prices”).

Accuracy issues

Sometimes the algorithm shows some products as similar to the selected one, but from the customer’s perspective the proposed items have nothing in common with the original products. In the video below I demonstrate this case. The smaller the product set, the more likely you will face this situation. The less words are used to describe the products, the less accuracy you will have.