Delta Detection and Incremental Updates in the Integration Processes: Reducing Unnecessary Writes to the Database

Category: Uncategorized Author: Rauf ALIEV 15 April 2024 No comments

An incremental load is a form of ETL (Extract, Transform, Load) process that involves transferring only the data that has been altered since the last loading session. This approach is both more efficient and quicker than performing a full load because it minimizes the volume of data that needs to be moved and processed.

However, implementing an incremental load necessitates mechanisms for recognizing and monitoring modifications in the source data, using methods like timestamps, flags, or versioning. Additionally, it must effectively manage issues like conflicts and duplications, for instance, when a single record undergoes multiple updates.

In this article I am focusing on my experience in implementing the optimized data loader integrated into the target system which was SAP Commerce. Simply put, it handles duplicates and minimizes the number of unnecessary database writes.

Implementing an optimized data loader for incremental updates, particularly in complex environments like SAP Commerce (formerly Hybris), requires a nuanced understanding of the data flow and an effective strategy for minimizing unnecessary writes to the database. Let’s focus on efficiently handling product data, availability, prices, order statuses, and customer information.

Delta detection is the cornerstone of any efficient incremental load process. By identifying only the changes since the last load, you dramatically reduce the volume of data transferred and processed.

Delta detection can be done either by comparing entire records or via a timestamp column. Delta detection on last update date / timestamp is the most efficient performance-wise, but very often it is not be supported by the upstream systems — especially if the information is collected from many sources and the timestamps from many systems should be combined into one — which works until we get a system not supporting last updated time at all.

So, common methods include:

Timestamps: Tracking the last update time of records to identify new and modified data.
Comparative Data Analysis. Upon receiving the new or updated data set, the system first performs a comprehensive comparison with its current data. This comparison can be done at various levels of granularity, from entire records down to specific fields within records. Additions, updates, and deletions are handled separately.

Each method has its pros and cons, and the choice often depends on the specific requirements of the integration and the capabilities of the source and target systems.

Timestamps

Scenario: When an update occurs in the upstream Product Information System, SAP Commerce is notified and receives a timestamp indicating when the product was modified. This timestamp is then compared against the one SAP Commerce has on record for the same product. If the newly received timestamp indicates a more recent update, SAP Commerce proceeds with the import process to update the product information. Should the timestamp be the same or earlier than what’s already recorded, SAP Commerce disregards the update, bypassing the import process for that item.

The source system may provide more than one timestamp — for example, for different sets of attributes.

Pros:

Simple to implement and understand.
Provides a clear chronological order of updates.

Cons and Challenges:

The source system does not supply timestamps indicating when each object was created or last updated and implementing it might be possible or might be expensive.

Comparative Data Analysis

Upon receiving the new or updated data set, the system first performs a comprehensive comparison with its current data. This comparison can be done at various levels of granularity, from entire records down to specific fields within records. The primary goals are to:

Identify New Entries: Records that are present in the incoming data but not in the current system dataset are flagged for addition.
Detect Updates: Records that exist in both the incoming data and the current system but have differences are flagged for update. This can involve complex logic if the system needs to identify changes at the field level.
Recognize Deletions: Records that are absent in the incoming data but present in the current system are flagged for deletion.

Pros:

No changes in the upstream systems
May work for basically any object; adapting for a new data type may take just little to no changes in the tested code

Cons and Challenges:

The comparative analysis and subsequent data operations can be resource-intensive
- For performance, you need to store a hash of an item in the database and ensure that the hash is recalculated each time the item is modified
- For performance, comparison “existing vs coming” can be done in memory that may be challenging for big datasets
May need comprehensive caching strategies
The logic for detecting changes and deciding on deletions must be robust to prevent data loss or inaccuracies
Deleting require batch load of the whole set to detect what existing items should be marked for removal
As the dataset grows, the system must be able to handle larger volumes of data without significant degradation in performance or accuracy

What SAP Commerce Provides Out-of-the-box

Technically, SAP Commerce doesn’t provide anything for avoiding unnecessary writes to the database. The closest thing is the piece called Delta Detection but it was designed for a bit different task.

SAP Commerce’s delta detection mechanism acts as a kind of external metadata for existing items, containing “changed/unchanged” flags and the date/time of changes. This is useful for synchronizing two instances of SAP Commerce.

The core of SAP Commerce’s delta detection lies in its versatile ChangeDetectionService. This service allows users to identify changes by specific items or by item type. When searching by type, the service can retrieve all items that have experienced any modifications—be it creation, modification, or deletion. Once changes are detected, they can be “consumed,” which means they are marked in a way that they won’t be re-identified in subsequent checks. This ensures a clean and efficient way to handle data integrity and consistency.

For that, the platform utilizes an object called ItemVersionMarker in which it tags changes, holding references to the altered item along with details such as the type of change and the version timestamp. This enables precise tracking and management of changes. Another significant feature is stream awareness, which accommodates multiple data streams within the platform, allowing changes to be tracked and managed distinctly across different streams. If you consume changes for a given stream so that it no longer detects ”old” changes, yet the other stream will still be able to detect changes.

However, this mechanism is useful for the scenario when SAP Commerce is a source system and exports the items and you want to avoid re-exporting the items not changed since the previous attempt. As for the task of “not re-importing what has already been received in SAP Commerce from an external system,” this functionality does not address it.

Custom Comparative Data Analysis Approach

To avoid unnecessary database writes, in my solutions, I introduced a hash attribute for all objects involved in the data exchange. It is assumed that any changes to such objects can occur exclusively in the master system, as the source of truth.

Accordingly, the hash is calculated only at the moment of this import and is recorded in the database. Technically, it can be re-calculated when an object is changed via the item interceptors mechanism.

In cases where there are a large number of objects, they are divided into groups, and the hash is additionally calculated for the group as well. For very large volumes, a hash can also be introduced for groups of groups. It will allow to quickly determine whether the element needs to be inspected in detail or not that saves a lot of time and memory for big datasets.

Suppose you have 100,000 products in your database, and you’ve received 50,000 updates from an external system—all of which, according to the external system, need to be made in SAP Commerce. However, SAP Commerce only needs updates for certain attributes, and in reality, updates may only be necessary for, say, 100 products.

The standard SAP approach in this case is to perform 50,000 updates. As you see, it takes a lot of time, resources, and basically useless — the system rewrites ABC with ABC which is unnecessary — especially it is important for multi-table data structures. This is how SAP Cloud Platform Integration works, this is how Datahub works, and this is how OData integration operates. There are no built-in mechanisms that would prevent an item from being updated when no changes in attributes are detected compared to the existing values in the database. In reality, SAP Commerce Platform can perform significantly more than 50,000 updates because there are related tables involved.

My solution offered a practical approach to optimizing database updates in SAP systems by using hash calculations:

Hash Calculation for Incoming Data: Before performing an update, calculate a hash based on the attributes that define product X as it comes from the external system.
Hash Retrieval from Database: Extract the hash from the database, which was built using the same algorithm but based on the attributes already stored in the database for product X.
Decision on Update Necessity:

1. If the hashes are different, there is a need to update the product in the database.
2. If the hashes are the same, there is no need to update the product.

This method can significantly reduce unnecessary data processing and improve the efficiency of system operations by ensuring that only necessary updates are performed. It leverages the fundamental property of hashes: any change in data will change the hash, providing a quick mechanism to detect modifications.

This approach was also used to optimize product indexing. I call it Smart Full Index Update. Before indexing a large group of products, the system first retrieves all hashes from SOLR and calculates all hashes for the products in the indexing group. If there is no difference, we do not perform indexing at all. Retrieving the sku-hash combinations from SOLR for 100,000 products is quite fast (seconds) and leads to a very rapid index update. However, there are nuances with indexing, which will be discussed below.

The main nuance is that SOLR requires not just one, but several hashes—corresponding to the number of objects associated with a product. For example, if the price changes, the product object itself does not change, and its hash remains the same. Therefore, a separate hash is needed for the price. Additionally, separate hashes are also required for images and stock levels.

If there is a change in price, a difference in the hashes for the price will be recorded, and ultimately the system will decide to “index.” However, if all hashes match, there is no point in recalculating the index.

If data transformation results in losses that prevent calculating the hash in the same way for the data in the database and the data from the incoming set, I recommend saving the pre-transformation value in the database and using it to calculate the hash. Most often, adding an additional attribute to the table does not lead to performance issues, while abandoning delta detection in the mechanism for importing large volumes of data almost certainly will.

Batch Import

Quite often, data from external systems do not arrive one at a time or in parts, but rather as a whole group. For example, all products, or all prices, or all stock levels, or all products of a certain category. If you follow the approach mentioned above, to “item needs to be updated” and “item needs to be created,” you also add “item needs to be deleted.”

Such integrations are not supported by SAP Commerce out of the box at all and should be custom.

In one of my recent projects, there were several such batches for different types—products, product features, stock levels, stores. For such cases, the algorithm divides the set of elements coming from the external system into the following groups:

To Update
To Create
To Ignore
To Remove (mark for removal)

To decide which group to place an element in, it’s necessary to extract all hashes from all existing elements, calculate hashes for all incoming elements using the same algorithm, and compare the hashes. In the case of equality, the element falls into the ToIgnore group. If it’s absent in the database, it goes into the toCreate group. If it’s present in the database but with a different hash, it goes into the toUpdate group. If it’s present in the database but not in the incoming batch, it goes into the ToRemove group.

By the way, it’s better to delete elements asynchronously. That is, elements from the toRemove group are never deleted immediately but are marked for deletion (for example, active=false or toRemove=true), and a separate process deletes them physically from the database. In my case, this separate deletion process used direct access to the database, and deletion was carried out via JDBC as DELETE ALL FROM table WHERE active=0. After all such operations, of course, it is necessary to clear the caches in the entire cluster. Compared to deleting elements one by one, this approach is super fast. Element-by-element deletion in our case could take about half an hour, whereas deletion via DELETE ALL took milliseconds.

Depending on the object, updating can be theoretically boosted with JDBC direct updates too. Even in SAP Commerce the stock service utilizes JDBC updates. For batch updates, it might be even more efficient — especially if big subsets of your product stock amounts are the same (like 1,2,3 etc.).

There is an important nuance: comparing large lists may require a lot of memory. Essentially, we’re only comparing hashes, but for a million products, that’s a million hashes from the database and a million calculated hashes from the input data. Here, it’s important to consider a more memory-efficient approach.

Whether to store all attributes from the external file in memory depends on what proportion of elements typically requires updating or creating in the database. If only a small portion of elements require writing, it might be better to read the input file twice—once for calculating hashes, and a second time to load attributes into memory for subsequent database updates or creating a database instance of that element.

Next, instead of pulling all elements from the database, you can pull elements from a single large group. How to determine such groups depends on the nature of the elements. For instance, for products, it could be by category. In this case, comparing lists will consume less memory.

And of course, it’s necessary to compare lists that are sorted by hash, or use a hashMap. The option with a hashMap is simpler to implement—essentially, the hashMap key is the hash, and a value is an item unique identifier. However, this approach is slower than comparing sorted lists.

To compare two lists of items with keys and hashes and categorize them based on specific criteria, you can employ a two-pointer approach, assuming the lists are sorted based on the keys.

        List<String> toUpdate = new ArrayList<>();
        List<String> toIgnore = new ArrayList<>();
        List<String> toCreate = new ArrayList<>();
        List<String> toRemove = new ArrayList<>();
        int incomingIndex = 0;
        int existingIndex = 0;

        while (incomingIndex < incomingItems.size() && existingIndex < existingItems.size()) {
            Item incomingItem = incomingItems.get(incomingIndex);
            Item existingItem = existingItems.get(existingIndex);

            int compareResult = incomingItem.key.compareTo(existingItem.key);
            if (compareResult < 0) {
                toCreate.add(incomingItem.key);
                incomingIndex++;
            } else if (compareResult > 0) {
                toRemove.add(existingItem.key);
                existingIndex++;
            } else {
                if (incomingItem.hash.equals(existingItem.hash)) {
                    toIgnore.add(incomingItem.key);
                } else {
                    toUpdate.add(incomingItem.key);
                }
                incomingIndex++;
                existingIndex++;
            }
        }

        // Add remaining incoming items to toCreate
        while (incomingIndex < incomingItems.size()) {
            toCreate.add(incomingItems.get(incomingIndex).key);
            incomingIndex++;
        }

        // Add remaining existing items to toRemove
        while (existingIndex < existingItems.size()) {
            toRemove.add(existingItems.get(existingIndex).key);
            existingIndex++;
        }

The function returns four lists, each containing the keys of items that fit the respective categories. This approach is efficient, operating in linear time relative to the combined size of both lists, provided they are sorted by keys.

Implementing this approach allowed to speed up the integrations flows tenfold. Implementing this for Solr indexing helped to enhance the deployment procedure — instead of performing slow full index (taking hours), we ran the optimized full index which involved only items changed since the last run.

By the way, note that the key will almost always be a compound key, incorporating more than one attribute, and your unique key definition should be revisited. For example, for an item type, it’s not just the code, but also, for instance, an active flag, which is reset if the item is marked for deletion. If it is marked for deletion, it should not be included in the existing items at all, but storing only the code would yield more than one item in the database, as inactive items (marked for deletion) accumulate alongside active ones.

There are some important points you should take into account before picking this solution for you needs:

Introducing a new attribute requires validating and updating the hash calculation. If you forget to include this attribute in the hash calculation formula, everything will work until you encounter an item that has changed only in this attribute. Ideally, the attribute-based hash calculation system should automatically include all new attributes. This complicates the system, especially if the attributes are referential.
- if you extract the attributes automatically via the type system service, don’t include modifiedtime, creationtime, comments, pk, owner, sealed, itemtype, and the hash attribute itself in the hash calculation formula.
for calculating a hash from the object, you need to make sure that all attribute types have a proper toString().
some data transformations should be inspected because they may introduce some data loss during the transformation. For example, you need to handle null Booleans properly. Sometimes Boolean.FALSE may be equivalent to null, sometimes it is not so. Basically, there are two corner cases — the incoming data don’t contain a Boolean/boolean attribute at all but it present in the database as FALSE or boolean attribute is null in the database and the incoming data contain FALSE. Also, some double/float values can be transformed before being saved to the database.
Ideally, such a system should have robust tests to cover all possible cases. These tests can also be challenging to maintain if the system involves many objects with complex structures.
if you mark items for deletion, you will have a situation when more than one item is in the database with a combination of attributes used to be unique keys. The items marked for deletion may accumulate faster than the clean up job removes them. So, your unique key strategy should be revisited.