Data format support
Apache Kafka is data payload agnostic. The message information is an array of bytes. The actual bytes can be the storage for:
Data format | Supported | Schema |
---|---|---|
AVRO | Yes | Yes |
PROTOBUF | Yes | Yes |
JSON | Yes | Inferred |
JSON with schema | Yes | Yes |
XML | Yes | Inferred |
For AVRO and PROTOBUF, it is expected that a schema registry solution is already in place.
JSON data format needs a bit of attention. While it is a simple and human-readable storage format, it's far from the ideal format for enforcing data quality and schema validation.
Schema mismatches
Every record is associated with a schema (explicit or inferred by the connector). This record schema may differ with the schema of the target table in Celonis Platform. Only two kind of schema evolutions are allowed:
Omitting a column that is not a primary key (since in Celonis Platform every column is nullable, unless it is part of a primary key)
Adding a new column
Any other kind of mismatch will cause the batch containing that record fail the insertion into Celonis Platform.
Schemaless formats
For (schemaless) JSON and XML data formats, schemas are not available, and the connector infers one for every record coming from the Kafka source.
This is not ideal for enforcing the data quality and schema validation, and may lead to unexpected behaviors.
Each Kafka message is self-contained. This means the information it carries is not dependent on previous messages. As a result, the connector can infer the schema only at the message level. But then a JSON document can contain a field address=null, or maybe the nullable field is not even written in the payload - for performance reasons. Therefore, there is no way to infer the type correctly, and tracking cross messages is not a bulletproof solution.
{ "firstName": "Alexandra", "lastName": "Jones", "address": null }
{ "firstName": "Alexandra", "lastName": "Jones", "address": null, "age": 32 }
In some use cases, the JSON payload can contain the schema, and for this, as stated in the table above, the support is better. Here is an example of a JSON with a schema that the Kafka Connect converter: JsonConverter can interpret.
{ "schema": { "type": "struct", "fields": [ { "type": "int64", "optional": false, "field": "registertime" }, { "type": "string", "optional": false, "field": "userid" }, { "type": "string", "optional": false, "field": "regionid" }, { "type": "string", "optional": false, "field": "gender" } ], "optional": false, "name": "ksql.users" }, "payload": { "registertime": 1493819497170, "userid": "User_1", "regionid": "Region_5", "gender": "MALE" } }