While the actual data in an AVRO formatted Datastores is essentially identical to a JSON Type Datastore, they look very different because the Schema, or structure of the data, is provided separately from the data itself. There are two benefits from using AVRO vs typical JSON:
- AVRO formatted data is very compact. Unlike simple JSON, the AVRO data payload contains only data. The Name portion of the name/value pair is provided in a separate schema record, eliminating both the characters in the name but also the repetition of those names in every single record written making for much smaller serialized data records.
- Perhaps more important is the compatibility AVRO can provide as source Data Descriptions evolve. This compatibility however requires a method for managing the schemas. Several commercial products provide "Schema Registry" functionality including Confluent and HortonWorks. Once a registry has been fully implemented, downstream "consumers" of the data will be provided the schema required at the point in-time the data was captured including both the old and new schema when processing data CDC data, so differences may be resolved symbolically, using field names.
Customers starting out with simple JSON target datastores often decide to switch to AVRO for cost or performance reasons. Testing changes however is often simplified with simple JSON because it is so much easier for a human to read. The Apply Engine facilitates switching back and forth by requiring only the DATASTORE "OF" parameter setting to be switched. Given the potential for eventual migration to AVRO, Precisely highly recommends use of the Engine OPTION USE AVRO COMPATIBLE NAMES .
------------------------------------------------------------
-- DATASTORE SECTION
------------------------------------------------------------
-- SOURCE DATASTORE
DATASTORE ./DB0A.ENGINE3.DEPT.COPY
OF UTSCDC
AS CDCIN
DESCRIBED BY GROUP SOURCE_TABLES;
-- TARGET DATASTORE
DATASTORE kafka://[<hostname>[:<port_number>]] / [<kafka_topic_id>][/ | /<partition> | /key | /root_key]
OF AVRO FORMAT [CONFLUENT TOMBSTONE | CONTAINER | PLAIN]
AS TARGET
KEY IS DEPTNO, MGRNO
DESCRIBED BY GROUP SOURCE_TABLES;
Keyword | Description |
---|---|
kafka://[<hostname>[:<port_number>]] / [<kafka_topic_id>] [/ |/<partition> | /key | /root_key] | URL syntax type that specifies a Kafka target. Optionally identify specific Kafka Broker Host name, TCP/IP port, unique Kafka Topic id and partition. Note: Various options are available for specifying the Kafka URL including a list of Brokers specified at runtime, dynamically generated Topic ID's and source key based partitioning, see Kafka Datastores. |
OF AVRO |
Specifies that the target datastore utilizes AVRO schemas, which are based on JSON but that separate the data structure description from a much more compact data payload. The AVRO Type Datastores also provide for schema evolution driven by changes in the source datastore Descriptions and/or custom target datastore Descriptions. For more details see both AVRO Type Datastores , JSON Type Datastores below and the OPTION: USE AVRO COMPATIBLE NAMES. |
AS <datastore_alias> | Specifies the alias name of the datastore, may contain only dash (-) and underscore (_) separators. All references to this datastore in the Apply Engine script will be through this alias name. Though not required or a default, the alias CDCIN is used for source datastores by convention in most CDC Apply Engine scripts regardless of source or platform. Adopting this convention will simplify both development, maintenance and diagnosis across multiple implementations and applications. CDCIN will be used in this manner throughout all product documentation. Similar conventions such as TARGET will often be used for target or output datastores, particularly when there is a single or primary "target" datatastore . |
DESCRIBED BY GROUP <group_name> |
Specifies the <group_name> , previously specified by a BEGIN GROUP Command that contains one or more DESCRIPTION entries. |
[FORMAT [CONFLUENT | CONTAINER | PLAIN] | Specifies the method of providing the Schema.
|
Connect CDC SQData V4 supports schema Evolution, however today that extends only to the Replicator Engine which automatically updates a Confluent Schema Registry on the fly when the relational source catalog changes and both the new metadata and the published CDC records reflect the change. Fortunately the implementation of a data structure change by a source system is an event typically requiring careful planning and a scheduled implementation. That process also facilitates the testing and subsequent manual intervention currently required by the Apply Engine when a Source or custom Target schema changes:
- Receive notification of pending change to source description and implementation date.
- Determine if change will materially affect downstream processing. If for example a column is added to a table or a field added to the end of an IMS segment and neither will affect downstream processing, the change may be ignored.
- Modify the source Description to reflect the new source table/record layout.
- Reparse the Apply Engine script with new source Description taking care to not replace the current "PRODUCTION" version of the .prc file. Each time the script is parsed, the Confluent Schema Registry will be queried and a a new version added if needed. In the early stages of script development or when many parses are expected before something workable is ready it may be desirable to comment out the OPTIONS parameter CONFLUENT REPOSITORY <registry_url>. A copies of the generated schemas will be written to the Working Directory and be named "<datastore>.schema.json".
- Test the revised script.
- Schedule the "production" implementation of the revised script and Schema ID to occur at the same time as the implementation of those changes to the Source system. It will be critical to ensure that all existing captured data has been processed before implementing the new script.
- Stop the Apply Engine
- Reparse the "Production" engine containing the revised source description.
- Start the Apply Engine and confirm that captured data for the modified Source is process correctly and is accepted by the Target datastore.