Skip to main content

Celonis Spark Connect

Celonis Spark Connect enables remote execution of Spark workloads on the Celonis Platform by decoupling the development environment from the backend. Use the storium-spark-connect library to run PySpark and Spark SQL from any local IDE or the Celonis ML Workbench. This architecture combines local flexibility with cloud-scale processing power.

By leveraging the Celonis Spark Connect engine, your team can build faster and scale further. Transform complex data into actionable insights within your existing development life cycle using the following capabilities:

  • Direct data pool integration: Read and write to data pools directly via Python or SQL. This bypasses the Celonis data model and Ingestion API, removing unnecessary ETL loops.

    Important

    The associated data pool must be using the ETL Engine. For more information, see Prerequisites.

  • Local development, remote execution: Use preferred IDEs like VS Code or PyCharm with local version control. Heavy computational tasks are offloaded to the Celonis Spark backend.

  • PySpark scalability: Process large datasets that traditional SQL engines cannot handle. Your data pipelines scale horizontally as your requirements grow.

Your environment must meet the following prerequisites for using Celonis Spark Connect:

  • Celonis Spark Connect can only be integrated with data pools using the ETL Engine. Before trying to integrate it, ensure your data pool is using the ETL Engine. For more information, see ETL Engine.

To begin using Celonis Spark Connect, install the client library from the Celonis private index. To do so, run the following command in your terminal:

pip install --extra-index-url=https://pypi.celonis.cloud/ storium-spark-connect

The @transform decorator is the "engine room" of the Celonis Spark Connect library. It simplifies the pipeline process by automatically managing the Spark session and handling the input and output between your local script and the Celonis data pool.

To establish a secure connection between your local environment and Celonis, you will need the following specific credentials, which are typically found in the Data Integration and Admin & Settings sections of your Celonis team:

  • CELONIS_URL: Your team’s base URL (https://your-team.celonis.cloud).

  • SCHEMA_ID: The unique ID of the Data Pool you want to read/ write data to.

  • CLIENT_ID and CLIENT_SECRET: These are generated by creating an OAuth Client in Celonis. Ensure this OAuth Client has permissions for the following scope as well as Read and Edit permissions for the respective Data Pool:

    spark_connect_permissions.png

    Important

    These credentials should be kept secure and can be managed via environment variables for production scripts.

The @transform decorator takes several arguments to define the scope of your work:

  • schema: The unique ID of the Data Pool you want to read/ write data to.

  • celonis_url: Your team’s base URL (https://your-team.celonis.cloud).

  • client_id / client_secret: Your authentication credentials.

  • source_df: The name of the table in Celonis you want to use as input.

  • output_df: The name of the table to create or update in the Data Pool.

Inside the decorated function, source_df and output_df are passed as objects that provide helper methods like .dataframe() (to get a PySpark DataFrame) or .write_dataframe() (to save results).

Example:

@transform(
    schema="your-data-poo-id",
    celonis_url="your-celonis-team-url",
    client_id="your-client-id",
    client_secret="your-client-secret",
    source_df="car_accident_dataset",
    output_df="cleaned_accidents_python",
)

You can define as many input and output tables as needed within the @transform decorator. The library uses the following naming convention to distinguish between them:

  • If the name contains source (e.g. source_df, source_orders), it uses the table as the Input table

  • If the name contains output (e.g. output_df, output_orders), it uses the table as the Output table.

Example:

@transform(
    schema="your-data-pool-id",
    source_df="table1",      # → Input
    source_orders="table2",  # → Input
    output_df="table3",      # → Output
    output_orders="table4"   # → Output
)

The schema object is a catalog explorer provided to your function. You can use it to inspect tables and columns or retrieve programmatic lists:

  • schema.show_tables(): Prints a formatted list of all tables and views in the schema.

  • schema.list_tables(): Returns a programmatic list of Spark Table objects.

  • schema.show_columns(table_name): Prints the schema, types, and nullability of a specific table.

  • schema.list_columns (table_name): Returns a list of Spark Column objects.

The @transform decorator provides input and output wrappers to interact with Celonis data. These wrappers manage the remote Spark session and handle data transfer via the following methods:

  • Reading data (Input): Use input wrappers to load data from Data Pools. These load data lazily and cache results to optimize performance. Available methods include .dataframe(), .count(), .columns(), and .show().

  • Writing data (Output): Use the write_dataframe(df, mode="...") method to save results back to Celonis. The following write modes are supported:

    • overwrite (Default): Replaces the existing table; functionally equivalent to a full load.

    • append: Adds new records to the existing table; functionally equivalent to a delta load.

Use get_spark_session directly to test logic or query data without persisting results to a table. This method provides manual control over the Celonis Spark Connect session, making it ideal for exploratory analysis and debugging.

from storium_spark_connect import get_spark_session

# Configuration
SCHEMA_ID = "your-data-pool-id"
CELONIS_URL = "your-celonis-team-url"
CLIENT_ID = "your-client-id"
CLIENT_SECRET = "your-client-secret"

def main():
    spark = get_spark_session(
        celonis_url=CELONIS_URL,
        schema=SCHEMA_ID,
        client_id=CLIENT_ID,
        client_secret=CLIENT_SECRET,
    )
    try:
        # Direct SQL execution against the data pool, add your SQL query here
        df = spark.sql("SELECT * FROM table LIMIT 20")
        df.show()
    finally:
        spark.stop()

Use the @transform decorator to execute standard SQL logic while leveraging the Celonis Spark Connect engine for high-performance processing. This approach allows you to perform complex data transformations and write results directly back to the Data Pool using a SQL-first workflow.

from storium_spark_connect import transform

@transform(
    schema="your-data-pool-id",
    celonis_url="your-celonis-team-url",
    client_id="your-client-id",
    client_secret="your-client-secret",    
    source_df="source_table_name",
    output_df="target_table_name",
)
def sql_transformation(schema, source_df, output_df):
    # Use Spark SQL for the transformation logic
    spark = source_df.spark
    cleaned = spark.sql("""SELECT * FROM source_table_name""")
    # Direct write-back to Celonis
    output_df.write_dataframe(cleaned)

This is the most versatile use case, utilizing the native PySpark DataFrame API for complex transformation logic. It provides maximum flexibility for programmatic data manipulation while ensuring horizontal scalability across massive datasets.

from storium_spark_connect import transform
from pyspark.sql import functions as F
from pyspark.sql import DataFrame

@transform(
    schema="your-data-pool-id",
    celonis_url="your-celonis-team-url",
    source_df="source_table_name",
    output_df="target_table_name",
)
def python_transformation(schema, source_df, output_df):
    # Convert input to a PySpark DataFrame
    df = source_df.dataframe()
    
    # Leverage PySpark's scalable API
    df = df.withColumn(
    )
    
    # Write back to Celonis Data Pool
    output_df.write_dataframe(df)
Package                            Version
---------------------------------- ---------------
aiohappyeyeballs                   2.4.4
aiohttp                            3.11.10
aiosignal                          1.2.0
annotated-doc                      0.0.4
annotated-types                    0.7.0
anyio                              4.7.0
argon2-cffi                        21.3.0
argon2-cffi-bindings               21.2.0
arro3-core                         0.6.5
arrow                              1.3.0
asttokens                          3.0.0
astunparse                         1.6.3
async-lru                          2.0.4
attrs                              24.3.0
azure-common                       1.1.28
azure-core                         1.37.0
azure-identity                     1.20.0
azure-mgmt-core                    1.6.0
azure-mgmt-web                     8.0.0
azure-storage-blob                 12.28.0
azure-storage-file-datalake        12.22.0
babel                              2.16.0
beautifulsoup4                     4.12.3
black                              24.10.0
bleach                             6.2.0
blinker                            1.7.0
boto3                              1.40.45
botocore                           1.40.45
cachetools                         5.5.1
certifi                            2025.4.26
cffi                               1.17.1
chardet                            4.0.0
charset-normalizer                 3.3.2
click                              8.1.8
cloudpickle                        3.0.0
comm                               0.2.1
contourpy                          1.3.1
cryptography                       44.0.1
cycler                             0.11.0
Cython                             3.1.5
databricks-agents                  1.9.1
databricks-sdk                     0.67.0
dataclasses-json                   0.6.7
dbus-python                        1.3.2
debugpy                            1.8.11
decorator                          5.1.1
defusedxml                         0.7.1
deltalake                          1.1.4
Deprecated                         1.2.18
distlib                            0.3.9
distro                             1.9.0
distro-info                        1.7+build1
docstring-to-markdown              0.11
executing                          1.2.0
facets-overview                    1.1.1
fastapi                            0.128.0
fastjsonschema                     2.21.1
filelock                           3.17.0
fonttools                          4.55.3
fqdn                               1.5.1
frozenlist                         1.5.0
fsspec                             2023.5.0
gitdb                              4.0.11
GitPython                          3.1.43
google-api-core                    2.28.1
google-auth                        2.47.0
google-cloud-core                  2.5.0
google-cloud-storage               3.7.0
google-crc32c                      1.8.0
google-resumable-media             2.8.0
googleapis-common-protos           1.65.0
grpcio                             1.67.0
grpcio-status                      1.67.0
h11                                0.16.0
hf-xet                             1.2.0
httpcore                           1.0.9
httplib2                           0.20.4
httpx                              0.28.1
huggingface_hub                    1.2.4
idna                               3.7
importlib_metadata                 8.5.0
iniconfig                          1.1.1
ipyflow-core                       0.0.209
ipykernel                          6.29.5
ipython                            8.30.0
ipython-genutils                   0.2.0
ipywidgets                         7.8.1
isodate                            0.7.2
isoduration                        20.11.0
jedi                               0.19.2
Jinja2                             3.1.6
jiter                              0.12.0
jmespath                           1.0.1
joblib                             1.4.2
json5                              0.9.25
jsonpatch                          1.33
jsonpointer                        3.0.0
jsonschema                         4.23.0
jsonschema-specifications          2023.7.1
jupyter_client                     8.6.3
jupyter_core                       5.7.2
jupyter-events                     0.12.0
jupyter-lsp                        2.2.5
jupyter_server                     2.15.0
jupyter_server_terminals           0.5.3
jupyterlab                         4.3.4
jupyterlab_pygments                0.3.0
jupyterlab_server                  2.27.3
jupyterlab_widgets                 1.1.11
kiwisolver                         1.4.8
langchain-core                     1.2.6
langchain-openai                   1.1.6
langsmith                          0.6.1
launchpadlib                       1.11.0
lazr.restfulclient                 0.14.6
lazr.uri                           1.0.6
litellm                            1.75.9
markdown-it-py                     2.2.0
MarkupSafe                         3.0.2
marshmallow                        3.26.2
matplotlib                         3.10.0
matplotlib-inline                  0.1.7
mccabe                             0.7.0
mdurl                              0.1.0
mistune                            3.1.2
mlflow-skinny                      3.8.1
mmh3                               5.2.0
msal                               1.34.0
msal-extensions                    1.3.1
multidict                          6.1.0
mypy-extensions                    1.0.0
nbclient                           0.10.2
nbconvert                          7.16.6
nbformat                           5.10.4
nest-asyncio                       1.6.0
nodeenv                            1.10.0
notebook                           7.3.2
notebook_shim                      0.2.4
numpy                              2.1.3
oauthlib                           3.2.2
openai                             2.14.0
opentelemetry-api                  1.39.1
opentelemetry-proto                1.39.1
opentelemetry-sdk                  1.39.1
opentelemetry-semantic-conventions 0.60b1
orjson                             3.11.5
overrides                          7.4.0
packaging                          24.2
pandas                             2.2.3
pandocfilters                      1.5.0
parso                              0.8.4
pathspec                           0.10.3
patsy                              1.0.1
pexpect                            4.8.0
pillow                             11.1.0
pip                                25.0.1
platformdirs                       4.3.7
plotly                             5.24.1
pluggy                             1.5.0
prometheus_client                  0.21.1
prompt-toolkit                     3.0.43
propcache                          0.3.1
proto-plus                         1.27.0
protobuf                           5.29.4
psutil                             5.9.0
psycopg2                           2.9.11
ptyprocess                         0.7.0
pure-eval                          0.2.2
pyarrow                            21.0.0
pyasn1                             0.4.8
pyasn1-modules                     0.2.8
pyccolo                            0.0.71
pycparser                          2.21
pydantic                           2.10.6
pydantic_core                      2.27.2
pyflakes                           3.2.0
Pygments                           2.19.1
PyGObject                          3.48.2
pyiceberg                          0.10.0
PyJWT                              2.10.1
pyodbc                             5.2.0
pyparsing                          3.2.0
pyright                            1.1.394
pyroaring                          1.0.3
pytest                             8.3.5
python-apt                         2.7.7+ubuntu5.2
python-dateutil                    2.9.0.post0
python-dotenv                      1.2.1
python-json-logger                 3.2.1
python-lsp-jsonrpc                 1.1.2
python-lsp-server                  1.12.2
pytoolconfig                       1.2.6
pytz                               2024.1
PyYAML                             6.0.2
pyzmq                              26.2.0
referencing                        0.30.2
regex                              2024.11.6
requests                           2.32.3
requests-toolbelt                  1.0.0
rfc3339-validator                  0.1.4
rfc3986-validator                  0.1.1
rich                               13.9.4
rope                               1.13.0
rpds-py                            0.22.3
rsa                                4.9.1
s3transfer                         0.14.0
scikit-learn                       1.6.1
scipy                              1.15.3
seaborn                            0.13.2
Send2Trash                         1.8.2
setuptools                         78.1.1
shellingham                        1.5.4
six                                1.17.0
smmap                              5.0.0
sniffio                            1.3.0
sortedcontainers                   2.4.0
soupsieve                          2.5
sqlparse                           0.5.5
ssh-import-id                      5.11
stack-data                         0.6.3
starlette                          0.50.0
strictyaml                         1.7.3
tenacity                           9.0.0
terminado                          0.17.1
threadpoolctl                      3.5.0
tiktoken                           0.12.0
tinycss2                           1.4.0
tokenize_rt                        6.1.0
tokenizers                         0.22.2
tomli                              2.0.1
tornado                            6.5.1
tqdm                               4.67.1
traitlets                          5.14.3
typer-slim                         0.21.1
types-python-dateutil              2.9.0.20251115
typing_extensions                  4.12.2
typing-inspect                     0.9.0
tzdata                             2024.1
ujson                              5.10.0
unattended-upgrades                0.1
uri-template                       1.3.0
urllib3                            2.3.0
uuid_utils                         0.12.0
uvicorn                            0.40.0
virtualenv                         20.29.3
wadllib                            1.3.6
wcwidth                            0.2.5
webcolors                          25.10.0
webencodings                       0.5.1
websocket-client                   1.8.0
whatthepatch                       1.0.2
wheel                              0.45.1
whenever                           0.7.3
widgetsnbextension                 3.6.6
wrapt                              1.17.0
yapf                               0.40.2
yarl                               1.18.0
zipp                               3.21.0
zstandard                          0.23.0