PredictionIO: Machine Learning in e-commerce. Complementary products based on the order history

Nowadays e-commerce companies need to analyze a large volume and a great variety of data on products, customers, transactions, and deliveries to increase conversion rate.

I managed to integrate hybris and PredictionIO machine learning server. There are different scenarios of using PredictionIO algorithms in e-commerce. This article demonstrates Complimentary categories functionality.


There are different approaches on how to integrate hybris with machine learning (ML) software:

  • Create the ML algorithm from scratch. This approach is extremely difficult and may be used only for unique tasks. You need to have a team of experienced gurus.
  • Use ML libraries. Such libraries as Weka and Apache Machaut help developers to create smart systems using the ready-to-use, well-documented and well-tested modules without going too deep into details. They contain tools for data pre-processing, classification, regression, clustering, association rules, and visualization. However, these tools are also for the experienced algorithmists. Apache’s own machine learning library for Spark and Hadoop, MLlib boasts a gamut of common algorithms and useful data types, designed to run at speed and scale.
  • Use cloud services. Google Prediction API, Microsoft Azure Machine Learning API and Amazon Machine Learning API take care of algorithms and provide you with clear interfaces for standard tasks.
  • Use machine learning on-premise frameworks/servers.  Microsoft introduced their Microsoft Azure Stack, that enables to deliver Azure services from your own servers. PredictionIO is also a machine learning platform that can be deployed in your data center. It offers a number of recipes, the ready algorithms, and examples of their usage.


There are the following concepts that currently exist in PredictionIO:

  • Items. Represents unique id of products or service.
  • Users. Represents unique id of a subject entity. Сonnected to an Item through a one to many relationships with Actions.
  • Actions. Responses each user deliver to aid the prediction. Action feedbacks:Conversion, like, dislike, view. Can send events to the PredictionIO API.
  • Algorithm. Actual computation code that generates prediction models. PredictionIO currently offers native support to Spark MLlib machine learning library. Most of the algorithms are mentioned as either user based or item based.
  • Engines. Container for algorithms and data. Three different containers to choose from: Ranking, recommendation, and similar items. Can be adjusted the freshness and also the exploration of the predictions as well as what objective your predictions are going to maximize.
  • Templates. The template library is a set of the special pluggable extensions for specific machine learning tasks.


Hybris interacts with PredictionIO in two ways:

  • Pushing events to PredictionIO
  • Requesting for the recommendations

There is a clear and simple RESTful interface both for events and for the queries.


Template library

Templates are kind of configuration of PredictionIO to implement the specific tasks. It contains custom classes and prebuilt JSONs.

  • Classification. The default use case of Classification Engine Template is to predict the service plan a user will subscribe to based on user attributes (such as age, gender etc).  Requires “user” and “item” entities that are set by events.
  • Complimentary Purchase. Recommends the complementary items which most user frequently buy at the same time together with one or more items in the query. Returns an array top n recommended items given the condition. The engine will use each combination of the query items as a condition. The PoC (see below) demonstrates this template.
  • ECommerce Recommendation. Provides personalized recommendation for e-commerce applications with the following features by default: 1) Exclude out-of-stock items 2) Provide recommendation to new users who sign up after the model is trained, 3) Recommend unseen items only (configurable)  4) Recommend popular items if no information about the user is available. White and black lists of items and categories are optional input parameters. Returns a ranked list of recommended itemIDs.
  • Lead Scoring. Predicts the probability of the converting the user into the customer in the current session. Input data are logs: landing page ID, Referrer ID and Browser (extendable). Returns a score.

  • Similar Product.  Recommends products that are “similar” to the input product(s). Similarity is not defined by user or item attributes but by users’ previous actions. By default, it uses ‘view’ action such that product A and B are considered similar if most users who view A also view B. The template can be customized to support other action types such as buy, rate, like..etc.This template is ideal for recommending products to customers based on their recent actions. Using the IDs of the recently viewed products of a customer as the Query, the engine will predict other products that this customer may also like.
    This approach works perfectly for customers who are first-time visitors or have not signed in. Recommendations are made dynamically in realtime based on the most recent product preference you provide in the Query. You can, therefore, recommend products to visitors without knowing a long history about them.You can also use this template to build such functionality as “Customers Who Viewed This Item Also Viewed…” quickly and easily.
  • Product Ranking. Sorts a list of products for a user based on his/her preference. This is ideal for personalizing the display order of product page, catalog, or menu items if you have a large number of options. It creates engagement and early conversion by placing products that a user prefers on the top. Input query contains user id and a list of ItemIDs, which are the products to be ranked. Returns a ranked list of recommended itemIDs.
  • And other templates. The full list of them is available here.

One of the earliest and most successful recommender technologies is collaborative filtering, also referred as social filtering. Most of the listed templates use this approach. It is based on the idea that people who agreed in their evaluation of certain items in the past are likely to agree again in the future. In different systems “evaluation” can mean different things. Likes, ratings, purchases, reviews.

Integration with hybrispredictionio2

PredictionIO is a separate system with clear and simple interfaces based on REST. There are two interfaces: for events and for requests.

In order to update a prediction model, you need to execute a training operation.

Proof of concept

I have an access to the large amount of real e-commerce data:

  • 200,000 products,
  • 700 categories,
  • 400,000 customers,
  • 600,000 orders,
  • 1,500,000 order entries

producterdThere are a number of things possible to do with Prediction IO on these data. I tested several PredictionIO templates with these data. Let’s take Complimentary Purchase template.

Complementary products are ones that are used alongside each other, such as fish and chips or shampoo and conditioner. There are two ways to find these products:

  • Content-based filtering,
  • Collaborative filtering.

Item-to-item collaborative filtering reflects the ‘wisdom of crowds’. This approach is universal and generic than content based filtering where the system matches the attributes of the item viewed with attributes of other items to generate recommendations.

We have too many products in the set in comparison with the number of the order entries. The knowledge base will possibly be not enough to make good predictions.

However, we have a relatively small set of categories. What if we replace product Id with category Id in the order entries?

I removed all orders which have only one entry. The resulting set turned out much smaller, 170,000 items. However, it is large enough to be associated with 700-items category set.

First, we need to create an app:

[root@fedoralocal bin]# ./pio app new MyComplimentaryApp
[INFO] [App$] Initialized Event Store for this app ID: 1.
[INFO] [App$] Created new app:
[INFO] [App$] Name: MyComplimentaryApp
[INFO] [App$] ID: 1
[INFO] [App$] Access Key: kys6151n5jAjbpPmJEw2V4qO7ftkABnz41B8noWzFHMcLbus_BRKsQj6MVJTFQ2K 

Let’s check if the newly created app is in the list:

[root@fedoralocal bin]# ./pio app list
[INFO] [App$] Name | ID | Access Key | Allowed Event(s)
[INFO] [App$] MyComplimentaryApp | 1 | kys6151n5jAjbpPmJEw2V4qO7ftkABnz41B8noWzFHMcLbus_BRKsQj6MVJTFQ2K | (all)
[INFO] [App$] Finished listing 1 app(s).

Let’s generate an App from the template:

[root@fedoralocal bin]# ./pio template get PredictionIO/template-scala-parallel-complementarypurchase testApp
Please enter author's name: Rauf
Please enter the template's Scala package name (e.g. com.mycompany): com.epam
Please enter author's e-mail address: 
Author's name: Rauf
Author's e-mail:
Author's organization: com.epam
Would you like to be informed about new bug fixes and security updates of this template? (Y/n) n
Retrieving PredictionIO/template-scala-parallel-complementarypurchase
There are 6 tags
Using tag v0.3.3
Going to download
Redirecting to
Replacing org.template.complementarypurchase with com.epam...
Processing MyComplimentaryApp/build.sbt...
Processing MyComplimentaryApp/engine.json...
Processing MyComplimentaryApp/src/main/scala/Algorithm.scala...
Processing MyComplimentaryApp/src/main/scala/DataSource.scala...
Processing MyComplimentaryApp/src/main/scala/Engine.scala...
Processing MyComplimentaryApp/src/main/scala/Preparator.scala...
Processing MyComplimentaryApp/src/main/scala/Serving.scala...
Engine template PredictionIO/template-scala-parallel-complementarypurchase is now ready at MyComplimentaryApp

Change the name of the app in the generated engine.json:

"id": "default",
"description": "Default settings",
"engineFactory": "com.epam.ComplementaryPurchaseEngine",
"datasource": {
"params" : {
"appName": "MyComplimentaryApp"
"algorithms": [
"name": "algo",
"params": {
        "basketWindow" : 120,
        "maxRuleLength" : 2,
        "minSupport": 0.001,
        "minConfidence": 0.1,
        "minLift" : 1.0,
        "minBasketSize" : 2,
        "maxNumRulesPerCond": 5
} ] 

Build it:

[root@fedoralocal bin]# cd MyComplimentaryApp
[root@fedoralocal MyComplimentaryApp]# ./pio build 

I slightly changed the importing script (in the data folder) to process the data file:

[root@fedoralocal bin]# time -p python --access_key kys6151n5jAjbpPmJEw2V4qO7ftkABnz41B8noWzFHMcLbus_BRKsQj6MVJTFQ2K --file /home/osboxes/recomm/order-product.csv 
Importing data...
173899 events are imported.
real 1458.95
user 92.27
sys 17.65

The script basically pushes data to the Event Server using EventServer API.

The data  we imported look like that:

[root@fedoralocal rauf]# cat ~osboxes/recomm/order-product.csv | head

The first column stands for order number and the second number is a category id.

The next command trains a model using data we imported earlier:

[root@fedoralocal MyComplimentaryApp]# ./pio train

[root@fedoralocal MyComplimentaryApp]# ./pio deploy

The last command creates a prediction server that listens to the port 8000 by default. This server uses the prediction model built by pio train

Once the system is trained and launched, we can try to request recommended categories for … let’s say #10001 (Semper Purée, it is a brand-level category):

[root@fedoralocal osboxes]# curl -k -X POST \ \ 
-H "Content-type: application/json" \
-d "{ \"items\" : [\"10001\"], \"num\" : 3 }"

The formatted form of the response:


For the sake of convenience, I created a simple bash script that resolves the category names and displays the recommendations in the compact form:

[root@fedoralocal osboxes]# ./ 10001 
recommendations for Semper Purée
 - confidence=0.5107913669064749
*Oatmeal porridge
 - confidence=0.2517985611510791
 - confidence=0.15827338129496402

Let’s try to request recommendations for “Jeans / Boys” (#22):

[root@fedoralocal osboxes]# ./ 22
recommendations for jeans_boys
- confidence=0.3023255813953488


  1. Zeroual Yassine · · Reply

    Thanks a lot very usefull article


    1. Thank you!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: