How to design the data pipeline? Replications vs Data Jobs
Replication Cockpit (RC) and Data Jobs (DJ) are 2 different tools that basically serve the same purpose - define and orchestrate how the data is fetched from the source system to the Celonis Platform. RC does not support all of the ETL use cases yet, and it has to co-exist with the DJs to enable a fully functional data pipeline.
The matrix below summarises for which use cases RC is the recommended tool. In the current public version RC supports real-time extractions and real-time transformations - both for Transactional and Metadata tables. Additionally, the Replication Cockpit allows you to execute Full Extractions (called Initializations).
While there is no exact formula to distinguish between Transactional vs Metadata tables, the general approach is that if a table includes a Case or Activity, then it is Transactional and should be extracted in real-time. Examples are EKKO, EKPO, CDPOS, CDHDR, BKPF, etc. As for the Metadata tables, they include relatively static information that is updated not very frequently and/or is not very important from an analytical perspective. These tables can be extracted both via RC and DJ. All of T.... and D.... tables fall in this category, and sometimes the master data tables (examples - KNB1, LFA1, MARA, MARC) can also belong here, assuming their data is not important for operational use cases.
Full | Delta | ||
---|---|---|---|
Transactional Tables | Extraction | RC | RC |
Transformation | DJ | RC | |
Metadata Tables | Extraction | DJ or RC | DJ or RC |
Transformation | DJ | DJ or RC | |
Data Model Load | DJ |
Ultimately, RC can be utilized to run the continuous data replication from the source system, covering transactional and also if necessary metadata tables. The Initialization functionality allows you to execute full loads (of extractions) in the RC. Meanwhile, data model loads can only be orchestrated via DJs.
Note
The executions of Full Extractions (in Data Jobs) and Replications (in Replication Cockpit) should be coordinated to avoid conflicts.
We recommend using the Initialization functionality in Replication Cockpit for Full Loads instead of doing it in the Data Jobs.
When Full Loads are still being executed in Data Jobs the corresponding tables should be deactivated in the Replication Cockpit because:
there can be only one Data Push Job per table at the same time
otherwise it can lead to data loss
This can be easily done by using the Replication Calendar.
On the data connection level, you can set a limit on the number of parallel extractions. This limit counts for Extractions from Data Jobs as well as Replication Cockpit. Therefore, ongoing full loads can delay the processing of replications.