Delta transformations in data jobs

For a stable and optimized data pipeline, we recommend executing delta transformations in your data jobs by leveraging _CELONIS_CHANGE_DATE. This approach is designed to substantially boost your data pipeline performance, enabling the seamless implementation of an end-to-end delta pipeline.

The benefits of executing delta transformations using _CELONIS_CHANGE_DATE compared to the former concept via staging tables include:

The ability to use the same approach across both data jobs and data pools.
No impact on the APC you're consuming, as the concept of staging tables is no longer used.
The approach is technology agnostic, meaning it works independently of the source system that you're extracting data from.

Before configuring and executing your delta transformations using this approach, you should understand how delta transformations work.

Understanding delta transformations

Delta transformations focus on updating only newly extracted or changed data instead of updating the entire data set. This approach is more efficient and performant, especially when dealing with large datasets.

Although extractions typically operate in delta mode, incorporating only newly created or updated records from the source system into the Celonis Platform, transformations have traditionally been executed in full. This involves dropping and fully recreating tables containing manipulated data with each transformation run. For scenarios requiring low latency, this approach proves highly inefficient, resulting in long transformation times, particularly as data volume expands.

To help explain this, consider the following example scenario:

Example scenario

Data with a volume of 100s of millions of records representing the past 3 years is loaded into Celonis and a delta extraction picks up new records (usually just a few thousand rows) from the source system every 2 hours.
- Using full transformations: Each execution, all 100s of million records need to be processed again by the Celonis Data Storage from scratch although only a few thousand records have actually been updated or added.
  A common example for this is the event/activity table which gets dropped via:
```
DROP TABLE IF EXISTS _CEL_AP_BSIK_ACTIVITIES;
```
  Then recreated via:
```
CREATE TABLE _CEL_AP_BSIK_ACTIVITIES (
	"_CASE_KEY" VARCHAR(30)
	,"ACTIVITY_DE" VARCHAR(60)
	,"ACTIVITY_EN" VARCHAR(60)
	,"EVENTTIME" DATETIME
	, ...
)
```
  And finally all data is inserted again via e.g.:
```
INSERT INTO "_CEL_AP_BSIK_ACTIVITIES" (
    "_CASE_KEY"
    ,"ACTIVITY_DE"
    ,"EVENTTIME"
  , ...
)
SELECT 
    invoice."_CASE_KEY" AS "_CASE_KEY"
    ,'Vendor Creates Invoice' AS "ACTIVITY_EN"
    , CAST(invoice."date" as DATE) + CAST('00:00:01' AS TIME) AS "EVENTTIME"
  , ...
FROM invoice
```
- Using delta transformations: The existing data is persisted and the transformations only pick up the few thousand records that have changed and add those to the already transformed data set.
  With delta transformations, this drop and recreate logic gets replaced by a so-called merge logic which makes use of DELETE / INSERT and therefore processes only new and updated records.

Delta transformations - key concepts

While treating new records is quite straightforward, updating existing records in the above described activity table is a bit more complex. This requires knowledge of the following concepts (explained below):

The _CELONIS_CHANGE_DATE.
The upper and the lower bound.
The delta placeholder.

The _CELONIS_CHANGE_DATE

The _CELONIS_CHANGE_DATE automatically gets added by the Celonis system to each and every record (new as well as updated) as an additional column at the point in time a record gets written into the Celonis Data Storage. Therefore, it’s a perfect identifier of what has changed between two runs, i.e. what the delta is.

For example:

If you run a full extraction at “2023-01-01 00:00:01”, all records loaded into Celonis as part of that full extraction will have _CELONIS_CHANGE_DATE = “2023-01-01 00:00:01”.
If you then run a delta extraction a few hours later at “2023-01-01 08:00:00”, the existing records will still have the “old” _CELONIS_CHANGE_DATE (“2023-01-01 00:00:01”) while any new and updated records will have _CELONIS_CHANGE_DATE = “2023-01-01 08:00:00”.
You could therefore easily identify all new and updated records by querying.
For example, using the following query:
```
SELECT * FROM TABLE WHERE _CELONIS_CHANGE_DATE > "2023-01-01 00:00:01".
```

With this approach, the system stores and uses the required lower and upper bound per data job and table. That means the lower and upper bound to identify the delta records will differ between data jobs as those will be executed in a different frequency. The lower and upper bound will also differ for the same table when used in two different data jobs, again because those data jobs are executed in different frequencies and therefore need to be treated in isolation.

The upper and the lower bound

An advantage of the using the _CELONIS_CHANGE_DATE delta transformation approach is that it can be applied in cases where extractions and transformations are in different data jobs or even different data pools (e.g. central extraction pool and use case specific transformation pools). That also means that extraction and transformation aren't necessarily in sync and new data can be loaded into Celonis for a table while the same table gets referenced in a delta transformation. To ensure that the delta batch we are executing the delta transformations on is consistent across all transformations within a data job, you need to understand the concept of the upper and lower bound.

Upper bound: This is equal to the maximum _CELONIS_CHANGE_DATE that gets processed within one data job execution.

The upper bound for a table is established at the moment when the initial transformation of a data job, referencing the table, enters the execution phase (Status = Running). That means, we are “locking in” the delta batch for this table throughout the entire duration of a data job execution. This ensures that when the same table is referenced in a subsequent transformation within the same data job executed later, it will consistently operate on the same delta batch.

Lower bound: This is equal to the minimum _CELONIS_CHANGE_DATE that gets processed within one data job execution.

The lower bound for a table is defined as the upper bound of the previous successful run. This is to ensure that any new data that has been loaded into the system for this table since the last execution of a delta transformation gets transformed within the next delta execution. For the first delta transformation or on the first delta run after any full execution, the lower bound gets set to “minimum” meaning we are not setting any lower but only have an upper bound.

Using the upper and lower bound, we can easily identify the delta of any table as:

SELECT * from table where _CELONIS_CHANGE_DATE > 'lowerBound' AND _CELONIS_CHANGE_DATE =< 'upperBound'

In the example above, the delta would be represented as:

SELECT * from table where _CELONIS_CHANGE_DATE > '2023-01-01 00:00:01' AND _CELONIS_CHANGE_DATE =< '2023-01-01 08:00:00'

You need both the upper and lower bounds for the following reasons:

The upper bound provides the starting point for the following delta execution, helping you to run these across multiple data jobs or data pools.
If you only had a lower bound, the data batch you are transforming throughout a data job might be changing for the same table which could potentially lead to inconsistent data or certain JOINS between tables not being applied correctly (in theory this can happen today with full transformations as well but there it’s not that critical as such data will be processed with the next full run anyway).

The lower bound for all tables will only get updated in case the past two executions of the data job including all its transformations have been executed successfully. The past two executions are preferred over the past one to ensure referential integrity and address dependencies between tables, i.e. when there is a JOIN between them. In case a record in a dependent table arrives late, we might want to reprocess the records in a staging table, so that the JOIN works properly.

The placeholder

The placeholder brings all of the previous concepts together and ensures that you don’t have to worry about any _CELONIS_CHANGE_DATE, upper or lower bound, it’s all managed for you automatically by the Celonis Platform.

The placeholder is indicated by <$table> in the SQL code and depending on the execution type - delta vs full - the referenced table gets dynamically replaced with the respective lower and upper bounds:

In case of a full load, the placeholder <$tableName> gets automatically replaced with the following when executing the SQL query to select all records of that table:
```
SELECT * FROM table
```
In case of a delta load the placeholder <$tableName> gets automatically replaced with the following when executing the SQL query to ensure only the latest delta batch gets transformed:
```
SELECT * from table where _CELONIS_CHANGE_DATE > 'lowerBound' AND _CELONIS_CHANGE_DATE =< 'upperBound'
```

You cannot simply reference all tables that get joined in delta transformations via its placeholder but you need to carefully define the ones for which the delta batch along with its placeholder should be applied.

For example: We are choosing the transformation Add Activity: Record Goods Receipt (Delta) of the SAP Accounts Payable Process Connector from the Celonis Marketplace. The transformation looks like this (some minor parts that are irrelevant for this example have been removed):

INSERT INTO "_CEL_AP_EKBE_ACTIVITIES_DELTA" (
    "_CASE_KEY"
    ,"ACTIVITY_EN"
    ,"EVENTTIME"
    ,"USER_NAME"
    ,"USER_TYPE"
    ,"_SORTING"
)
SELECT
    B._CASE_KEY AS "_CASE_KEY"
    ,'Record Goods Receipt' AS "ACTIVITY_EN"
    ,CAST(EKBE.CPUDT AS DATE) + CAST(EKBE.CPUTM AS TIME) AS "EVENTTIME"
    ,EKBE.ERNAM AS "USER_NAME"
    ,USR02.USTYP AS "USER_TYPE"
    ,150 AS "_SORTING"
FROM
    "TMP_BKPF_BSEG_DELTA" AS B
    INNER JOIN "RSEG" AS RSEG
        AND B.MM_KEY = RSEG.MANDT || RSEG.BELNR || RSEG.GJAHR
    INNER JOIN <$EKBE> AS EKBE
        RSEG.MANDT = EKBE.MANDT
        AND RSEG.EBELN = EKBE.EBELN
        AND RSEG.EBELP = EKBE.EBELP
     LEFT JOIN "USR02" AS USR02
         EKBE.MANDT = USR02.MANDT
         AND EKBE.ERNAM = USR02.BNAME
WHERE
    EKBE.VGABE IN ('1') AND EKBE.SHKZG IN ('S')

You can see that the only table that gets referenced via a placeholder in this example is <$EKBE> (let’s ignore for a second the temporary table TMP_BKPF_BSEG_DELTA which has been created via a separate transformation also only taking into account new records). Unlike <$EKBE> all other tables are utilized without placeholders, following the conventional Full transformation approach.

This is required to capture all the relevant data from joined tables. One obvious example is the join with the user table USR02. This table doesn’t change frequently as new users only get onboarded to the SAP system every now and then. For the join between EKBE and USR02 to match any records, it therefore needs to be performed on the full, historic USR02 table. As the USR02 table gets rarely updated, a join between <$EKBE> and <$USR02> would result in no matches most of the time.

The guiding principle here is straightforward: the leading table per activity, defining the EVENTTIME of the activity, is referenced through its placeholder, and the delta is applied, while the remaining tables are joined in Full.

Example: Using Event/Activity Table

To account for any new activities, we need to ultimately INSERT those into the existing activity tables. The most important thing to consider here is: How do we ensure we are only inserting new records once and not inserting the same records into the event/activity table multiple times leading up to duplicates?

To account for this, we are taking a three step approach which should be split into three different transformation tasks for performance reasons. To illustrate how this works, we'll look at the delta transformation for the event 'Vendor Creates Invoice':

Create a separate delta activity table:

CREATE TABLE _CEL_AP_BSIK_ACTIVITIES_DELTA (
  	_CASE_KEY VARCHAR(50)
  	,ACTIVITY_EN VARCHAR(200)
	,EVENTTIME DATETIME
	,...
    	,MANDT VARCHAR(3)
    	,VBELN VARCHAR(10)
    	,POSNR VARCHAR(6)
)
ORDER BY 	
_CASE_KEY 
  	,ACTIVITY_EN 
    	,_ACTIVITY_KEY
;

In this example, we are using a separate activity table to hold the delta records. This is done for performance reasons so the INSERT can be split from the MERGE.

INSERT records into the delta activity table:

INSERT INTO "_CEL_AP_BSIK_ACTIVITIES_DELTA" (
    "_CASE_KEY"
    ,"ACTIVITY_EN"
    ,"EVENTTIME"
  , ...
)
SELECT 
    invoice."_CASE_KEY" AS "_CASE_KEY"
    ,'Vendor Creates Invoice' AS "ACTIVITY_EN"
    , CAST(invoice."date" as DATE) + CAST('00:00:01' AS TIME) AS "EVENTTIME"
  , ...
FROM <$invoice> AS invoice

In this example, we reference the placeholder indicated as <$invoice> which ensures that only new records of the invoice object get transformed by filtering the table using the lower and upper bound.

Merge new records of the delta activity table into the original activity table:

SELECT ANALYZE_STATISTICS(_CEL_AP_BSIK_ACTIVITIES_DELTA);
SELECT ANALYZE_STATISTICS(_CEL_AP_BSIK_ACTIVITIES);

INSERT INTO _CEL_AP_BSIK_ACTIVITIES                     
SELECT *
FROM _CEL_AP_BSIK_ACTIVITIES_DELTA AS NEW_ACTIVITIES
 WHERE NOT EXISTS (
                    SELECT 1 FROM _CEL_AP_BSIK_ACTIVITIES AS OLD_ACTIVITIES
                    WHERE 
                        NEW_ACTIVITIES._ACTIVITY_KEY=OLD_ACTIVITIES._ACTIVITY_KEY
                        AND NEW_ACTIVITIES.ACTIVITY_EN=OLD_ACTIVITIES.ACTIVITY_EN
                        AND NEW_ACTIVITIES._CASE_KEY=OLD_ACTIVITIES._CASE_KEY
                    );

In this example, we are merging the records of the Delta activity table into the existing activity table using A WHERE NOT EXISTS query that is comparing all records that are already available in the _CEL_AP_BSIK_ACTIVITIES table from the full transformation (referenced as OLD_ACTIVITIES) with the newly transformed records in the _CEL_AP_BSIK_ACTIVITIES_DELTA highlighted as NEW_ACTIVITIES based on three fields:

_ACTIVITY_KEY
ACTIVITY_EN
EVENTTIME

Example: Using object tables

When you have a price change on an invoice, this would normally have two impacts:

A new event/activity “Change Price” gets inserted into the event/activity table as described in the first scenario above.
The price on the object “invoice” itself needs to be updated to reflect the latest price. The update on the raw table of the invoice is automatically accounted for by the delta extraction, however the raw table is rarely loaded into the final data model but usually joined with other tables and further manipulated via transformations and the outcome written into a new table (which we’ll call invoice_manipulated here). Ultimately, invoice_manipulated is the relevant table that gets loaded into the data model. As the invoice_manipulated table is created via transformations, any update here needs to be accounted for also via transformations.
To account for the update in the invoice_manipulated table, we DELETE the existing records from the invoice_manipulated and INSERT the updated records again. To identify old and updated records, we are using the primary keys of the table. Let’s look into a concrete example for our invoice_manipulated table and start with the DELETE part, which looks like this:
```
DELETE FROM invoice_manipulated
WHERE EXISTS (SELECT 1 FROM <$invoice> as NEW_DATA 
              WHERE invoice_manipulated.PrimaryKey1 = NEW_DATA.PrimaryKey1 
              AND invoice_manipulated.PrimaryKey2 = NEW_DATA.PrimaryKey2
              AND invoice_manipulated.PrimaryKey3 = NEW_DATA.PrimaryKey3
              );
```
As with the event/activity table example above, we can see a WHERE EXISTS query that is comparing all records that are already available in the invoice_manipulated table from the previous transformation execution with the newly transformed records coming from the new delta batch of the invoice table (indicated via its placeholder <$invoice>) and deleting records based on matching primary keys (in our case the invoice table has three columns that are defining the primary key).
Within the same transformation, we are then inserting the new and updated records as follows:
```
INSERT INTO invoice_manipulated
	SELECT
	"invoice"."column1"
	,"invoice"."column2"
	,"invoice"."column3"
	, ...
FROM <$invoice> AS invoice
```

Delta transformations - FAQs

When configuring your delta transformations, consider the following frequently asked questions:

Where should I start the implementation?
We recommend the following:
- Start by using our standard data integration connectors as these include ready-to-use delta transformations for your source systems.
- Consider a relatively simple data job with a maximum of 10 transformations.
- Focus on any transformations wit bad performance or those that are well-known transformations or operational use cases.
Should I make the changes in my existing data jobs or create a copy of those?
Both are possible and the decision depends on the level of customization. The following Celonis Academy course includes a guide on how to do this: Celonis Academy - Setting up a Data Pipeline.
Do I still need Full transformations or could I just use delta transformations and rely on the placeholder to define whether all records or only the delta batch get selected?
Yes, Full transformations are still required for the following reasons:
- Capturing deletes: Similar to the delta extraction (except extractions via the Replication Cockpit based on triggers), deletes won’t be captured. Therefore, we suggest to once in a while (e.g. on the weekend) trigger a Full transformation run.
Are there process connectors leveraging the Delta transformation approach available in the Celonis Marketplace already?
We are currently working on adapting the existing Process Connectors to include Delta Transformations. We have one template for SAP O2C available that is leveraging the described approach that we can share on request. As the delta logic is the same across all Process Connectors this serves as a starting point.
In the Delta transformation templates downloaded from the Celonis Marketplace, I can see that there are still some tables getting dropped and recreated, such as in the SAP Accounts Payable transformation Create Table: BKPF_Delta - why is that?
This is the correct and expected behavior. In order to improve performance, certain temporary tables such as BKPF_Delta are created at the beginning of the Data Job every time it is run. If you look into the details, you will notice that the BKPF_UNION that gets created only contains the delta batch of BKPF_BSAK and BKPF_BSIK (as indicated by the placeholders):
```
CREATE OR REPLACE VIEW "BKPF_UNION" AS (
SELECT * FROM <$BKPF_BSAK>
UNION ALL
SELECT * FROM <$BKPF_BSIK>
);
```