Best Practices: Migrating Content to hybris
Introduction
Data migration, as a fundamental aspect of projects on modernizing legacy systems, has been recognized to be a challenging task that may result in failed projects as a whole. One of main reasons for time and budget overrun is the lack of a well-defined methodology that can help handle the complexity of data migration tasks. The process of data migration is tightly linked to the project scope. That explains why there is no universal tool that could be used for all situation. However, some ideas, concepts, architecture, and even components could be reused to make the process smoother and predictable. In this article I tell you about my own experience and my personal vision on the topic. I’ve been testing it in different projects where I was involved in for last 15 years. Data migration tools have been implemented by my teams and me in different programming languages, from Perl to Java.Business process
From a business process perspective, there are four sequential phases:
- What data is moving,
- What data we can get rid of,
- How it can be grouped,
- What data requires special handling,
- What data requires changes
- How volatile is the data etc.
Since we are talking about e-commerce, you are likely going to migrate at least the following data:
|
- How is data going to “fit” and work in hybris?
- How is the overall structure of the data going to transfer?
The following points should be considered if you deal with CMS/rich text content:
|
- Internal technical QA
- Client review, editorial QA
Data transfer phase
Data migration moves data from legacy systems into new systems. Data sources in the new system can have different structures and limitations. There are several issues that can considerably complicate this process:- Legacy systems may have a number of heterogeneous data sources. These data sources are interconnected but designed by different teams, in different time, using different data modeling tools or interpreted under different semantics. In addition to that they might be used differently than they were originally intended to be used by design.
- Legacy systems may have inaccurate, incomplete, duplicate or inconsistent data, and new systems may also require additional semantic constraints on data after being migrated. Thus, bringing the quality of data up to standard of new systems can be costly and time-consuming.
- Many data migration tasks such as data profiling, discovery, validating, cleansing, etc. need to be iteratively executed in a project and specification changes frequently happen in order to repair detected problems.
- Extract – the process of reading data from a specified source and extracting a desired subset of data. The extraction phase largely deals with the technical heterogeneity of the different sources and imports relevant data into a staging area.
- Transform. – the process of converting the extracted/acquired data from its previous form into the form it needs to be in so that it can be placed into another database. The transformation phase is the heart of an ETL process. Here, syntactical and semantical heterogeneities are overcome using specialized components, often called stages:
- All data is brought into a common data model and schema;
- Data scrubbing and cleansing techniques standardize the data;
- Aggregating or combining data sets;
- Duplicate detection algorithms etc.
- Load. This phase loads the integrated, consolidated, and cleaned data from the staging area into hybris.
Off the shelf solutions
![Pentaho_new_logo_2013[1].png](http://hybrismart.com/wp-content/uploads/2017/01/pentaho_new_logo_20131.png)
Architecture
These three modules covers Extract, Transform and Load phases of the process.

Data loader
During the extract phase of ETL, the desired data sources are identified and the data is extracted from those data sources. In my solution, the data loader converts input text files (XML) into the database create/insert statements (SQL). It looks like a very basic solution, but for the small and medium projects my experience shows that it is good enough. Input files. As a rule, there is a milestone in the project, when the input files (data extracts) are provided to the developer according to the agreed specification (formats, data structures etc.). It is virtually impossible to get right data from the first try, so you need to take into the consideration that this phase needs to be iteratively executed in a project.
output.
- XML files. It is the most convenient format for the data loader. XML Schemas help to constrain the content and prevent mistakes in data.
- CSV files. I recommend using CSV files only for very large amount of data of very simple structure. CSV files are good only when your data is strictly tabular and you 100% know its structure and this structure won’t be changed.
- ZIP files. For binary content referenced from an XML: images, digital assets, documents.

- Remove tables
- Create tables
- Insert data

-createTable.sql and <code>-inserts.sql
./data-loader -enable categories,products,product.descriptions ./data-loader -enable all ./data-loader -enable all -disable products,categoriesIn order to execute SQLs I use a separate app. The full cycle looks like
./cleanSQLs ./data-loader -enable all ./executeSQLAfter this phase, the database has been created, all data have been uploaded into the database. It is a turn of the data transformer.
Data transformer
The transform step applies a set of rules to transform the data from the source to the target.
- creates additional tables or/and fields (if needed)
- adds indexes (if needed)
- extract data from the tables and/or fields
- transforms the data according to agreed logic
- saves the transformed data into new tables and/or fields


Images and links
One of the challenging topics is converting embedded images and links into hybris media and page URLs.
![]() ![]() |
./data-transformation -enable categories,products ./data-transformation -enable all ./data-transformation -enable all -disable products,categoriesAlong with the previous phase,
./cleanSQLs ./data-loader -enable all ./executeSQL ./data-transformation -enable allAfter this phase, the database has been updated, the data are cleansed, enriched, the duplicates are removed etc. It is a turn of the impex exporter.
Impex Exporter
Exporter module creates a set of ImpEx files to load the data received from the previous two steps into hybris.
output
Video
<iframe src=”https://player.vimeo.com/video/199095008″ width=”640″ height=”360″ frameborder=”0″ webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>© Rauf Aliev, January 2017
nitin
12 January 2017 at 11:51
very informative , thanks for publishing these detailed steps
sergeyshimansky
18 January 2017 at 07:12
Great job, Rauf! Very valuable, thanks a lot.
hybris e-commerce
8 February 2017 at 09:41
Thank you for sharing this priceless experience. We use Hybris platform in our projects and your article was very informative and useful!