Connecting to Google BigQuery (extractor)

Important

Any references to third-party products or services do not constitute Celonis Product Documentation nor do they create any contractual obligations. This material is for informational purposes only and is subject to change without notice.

Celonis does not warrant the availability, accuracy, reliability, completeness, or usefulness of any information regarding the subject of third-party services or systems.

The Celonis BigQuery extractor allows you to transfer data from your Google BigQuery data warehouse to the Celonis Platform for process mining and analysis. It supports the following basic features:

Prerequisites

This section details important prerequisites or prerequisite knowledge for using this extractor.

Choosing the connection type

Before creating a connection between your database and the Celonis Platform you must decide which connection type you want to use. Except where stated in Supported database types, all databases have two basic connection types: Direct connections and Uplink connections via an on-premise extractor, as described below:

Direct connections: Use direct connections when you want to allow the Celonis Platform direct access to your database without additional infrastructure. Meaning, you do not need to install, patch, or maintain on-premises extractors, which speeds up implementation, reduces complexity, and simplifies operations.
Note
By default, all cloud-based extractors are direct connections.
Uplink connections via an on-premise extractor: Use uplink connections when you don't want to or can't allow the Celonis Platform to directly access your on-premise or private cloud database. The connection between the database and Celonis is then established using an on-premise extractor that's installed within your network ideally on a dedicated server.
The role of the extractor is to poll and fetch job requests from the Celonis Platform, before then submitting the execution information the the database via an SQL query. Once the data is retrieved from the database, the extractor fetches it and sends it back to the Celonis Platform. As such, the connection between the database and the Celonis Platform is always made by the extractor, with it continuously querying the Celonis Platform for any extractions to execute.
Note
To use an uplink connection, you must install an on-premise extractor in your environment. To do so, see Setting up. Additionally, if you want to use a proxy (optional), see Proxy settings for on-prem clients.

Running the JDBC extractor (Uplink connections only)

When using an uplink connection:

Download the latest JDBC package from the Celonis Download Portal: Updating the on-premise JDBC extractor.
Download the Google BigQuery Jar: Google JDBC drivers.

Place all jar files in a folder, and then run:

java -Dloader.path=<insert_name_of_folder_of_jars> -jar connector-jdbc.jar serve

Modifying your network settings

The next step is to modify your network settings to allow the database extractor to communicate with Google BigQuery and the Celonis Platform.

The settings here are based on the connection type you are using:

Network settings for direct connections

Source system	Target system	Port	Protocol	Description
Celonis Platform	Google BigQuery	443	TCP	JDBC connection from the Celonis Platform to the database. The port is the one you normally use to connect to the database. The IPs of the Celonis Platform depending on the cloud cluster (which can be seen in the URL).

Network settings for uplinked connections

The following network settings apply for uplinked connections (via the on-premise extractor):

Source system	Target system	Port	Protocol	Description
On-premise extractor server	Google BigQuery	443	TCP	JDBC connection from on-premise extractor server to the database. The port is the one you normally use to connect to the database.
On-premise extractor server	Celonis Platform	443	TCP	HTTPS connection from on-premise extractor server to Celonis cloud endpoint. The IPs of the Celonis Platform depending on the cloud cluster (which can be seen in the URL).

JDBC string guidelines

This section describes the guidelines for using custom JDBC strings in extractor configurations:

Authentication: The Credentials fields in the extractor configuration are required and always used to authenticate the connection. Do not embed credentials directly in your JDBC string.
Encryption: For standard (unencrypted) extractors (examples: SAP HANA, PostgreSQL), you can enable encryption by adding encrypt=true to the JDBC string. For encrypted extractors (examples: SAP HANA encrypted, PostgreSQL encrypted), connections are established with encryption enabled (encrypt=true) by default. You do not need to include this parameter in your JDBC string.
Certificate validation: Do not include validateCertificate=true in your JDBC strings. Instead, use Advanced Settings > Validate Certificate > Enabled.
Additional properties: You can include additional properties in either the JDBC string or the Additional Properties field. Do not specify the same properties in both places.

Authentication methods

This extractor supports the authentication methods described in the following sections.

Authentication method – Service Account

Before configuring the connection between your Google BigQuery account and the Celonis Platform, you must create a dedicated service account for the connection. To learn how to create and manage a service account, see Google Cloud documentation – Creating a service account .
After creating the service account, assign the following IAM roles to ensure Celonis can access and query your data:
- BigQuery Job User (apply at the project level)
- BigQuery Read Session User (apply at the project level)
- BigQuery Data Viewer (apply at the table, dataset, or project level)
- BigQuery Metadata Viewer (apply at the dataset or project level if Data Viewer is not granted)
For more information about IAM roles, see Google Cloud documentation – BigQuery IAM roles and permissions .
Log in to your Google Cloud Platform (GCP) account and enable the following APIs to allow Celonis to access BigQuery:
- BigQuery API
- BigQuery Storage API
During configuration in Celonis, you will need the following values from your Google Cloud setup:
- Service account email – The email address of the service account created for Celonis (for example, celonis-extractor@your-project-id.iam.gserviceaccount.com). This will be the value of Credentials > Service Account Authentication > Service Account Email ID.
- Service account key file (JSON) – The private key file downloaded when you created the service account. This file is used for authentication in Celonis. This will be the value of Credentials > Service Account Authentication > Service Account Credentials (JSON).

Authentication method – OAuth

Before configuring the connection between your Google BigQuery account and the Celonis Platform, you must enable the required Google APIs and create OAuth 2.0 client credentials. This authentication method uses the Simba Google BigQuery JDBC driver to establish the connection.
Log in to your Google Cloud Platform (GCP) account and enable the following APIs:
- BigQuery API
- BigQuery Storage API
After enabling the APIs, configure OAuth 2.0 credentials for the project that hosts your BigQuery datasets:
1. In the Google Cloud Console, navigate to APIs & Services → Credentials and click Create Credentials.
2. Select OAuth client ID.
3. Choose Web application as the application type.
4. Enter a name for the OAuth client and add the following authorized redirect URI:
```
https://auth.redirect.celonis.cloud/bigquery_redirect
          
```
5. Save the OAuth client. Once saved, reopen it to view and copy the generated credentials.
6. Copy both the Client ID and Client secret. You will need these values when configuring the connection in the Celonis Platform.
During configuration in Celonis, you will need the following values from your Google Cloud setup:
- Client ID – The OAuth client ID created in Google Cloud. This will be the value of Credentials > OAuth > Client ID.
- Client secret – The OAuth client secret created in Google Cloud. This will be the value of Credentials > OAuth > Client Secret.

Extracting large tables using connection parameters

When extracting large tables from your Google BigQuery account, you might encounter the following error message:

"message": "Response too large to return. Consider specifying a destination table in your job configuration."

To resolve this, add the following JDBC connection parameters to your BigQuery connection configuration:

LargeResultDataset=<dataset>
LargeResultTable=<table>

To use these parameters, ensure the following:

The specified <dataset> and <table> exist in your Google Cloud project.
The extraction user defined in your Celonis credentials configuration has the necessary permissions to write results to the specified table. For more information, see Google Cloud – Writing query results .

For more details about JDBC driver properties, see Google Cloud – Current JDBC drivers .

Configuring the Google BigQuery extractor

This section describes the basic setup of configuring the Google BigQuery extractor. To configure the extractor:

In the Celonis Platform left navigation, select Data > Data Integration.
On the Data Pools screen, select the data pool you want to use for the extraction.
Note
If you do not have a data pool to use for this extraction, see Creating and managing data pools for instructions on how to create one.
In the Data Integration section, select Connect to Data Source.
Note
If this is not the data pool's first connection, the Data Connections window opens below. Select + Add Data Connection to add a new connection.
In the Add Data Connection window, select Connect to Data Source.
In the Connect to Data Source window, depending on your use case, select either Database – On Premise or Database – Cloud.
Note
Select Database – On Premise to connect to on-premise or private cloud databases.
1. If you selected Database – On Premise, follow the on-screen instructions.
In the New Database Data Connection window, fill in the following information:
1. For Name, provide a name for this configuration.
2. For Database Type, select Google BigQuery.
3. For Connection Type, select either Standard or Custom JDBC Connection String.
  1. If you selected Standard:
    - For Host, enter:
      https://www.googleapis.com/bigquery/v2
    - For Port, provide the port to connect to (Default is 443).
    - For Database Name, enter the Project ID of your Google BigQuery project (not the project name).
    - (Optional) For Schema Name, enter the name of the dataset that contains the tables to extract.
    - (Optional) For Additional Properties, enter any additional connection properties required by your database or driver. Separate each with ;.
  2. If you selected Custom JDBC Connection String:
    Important
    When using JDBC strings, there are specific guidelines to follow. For more information, see JDBC string_guildelines.
    - For JDBC Connection String, provide your string. Use the format:
      jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=<project_id>;property1=value1;property2=value2...
      Note
      For more information on connecting to Google BigQuery with JDBC strings, see the Goggle Big Query documentation.
    - Optionally, provide values for:
      Schema Name: Enter the name of the dataset that contains the tables to extract.
      Additional Properties: Enter any additional connection properties required by your database or driver. Separate each with ;.
4. For Credentials, provide the username and password of the database user for this connection.
  Note
  Ensure this database user has sufficient permissions to access the data to be extracted.
5. If desired, select Advanced Settings, and update these parameters as needed.
  Note
  The Advanced Setting > Validate Certificate parameter (Default: DISABLED) controls whether the extractor validates the server’s SSL/TLS certificate:
  - Disabled: Disables certificate validation (validateCertificate=false).
  - Enabled: Enforces certificate validation (validateCertificate=true).
  - Removed: Uses the driver’s default behavior. Check the driver documentation to confirm the default.
Select the Test Connection button to confirm the extractor can connect to the host system. If the test fails, adjust the data in the configuration fields as needed.
Once the test connection passes, select the Save button to continue. This returns you to the Data Integration window.