While very nature of Kafka and its cluster based architecture is to accommodate large volumes of data from Producers, issues can occur. Various factors may make a contingency plan for a Kafka cluster desirable or even a requirement. One such contingency scenario is described below.
The principal issue associated with an extended Kafka outage traces back to the source of the data where the Connect CDC SQData Capture and Publishing occur. Achieving high end to end throughput is accomplished through the careful management of the data that has been captured and the efforts made to avoid the I/O operations required when that transient data must be "landed", in other words written to disk before it is consumed and written to its eventual target, in this case Kafka.
When the Apply or Replicator Engine is unable to write to Kafka that eventually translates to the need to hold the captured data and/or slow down its capture at the source. That can become problematic, particularly when the source itself generates a very high volume of Change Data. When an Engine stops, data cannot be published, committed units-of-work (UOWs) are captured and the data ordinarily held in memory must be written to a transient Storage area. Depending on the Capture that may be a zOS high performance LogStream or disk storage dedicated for this purpose. Eventually the transient data area will be exhausted. When that happens the Capture will slow and eventually stop reading the database log. This is considered an acceptable state albeit not normal or desirable but manageable. The problem that can occur is that the source database log files may be archived, moved to slower storage or even deleted. When that happens Capture will be unable to continue from where it left off and the possibility of data loss becomes reality. Talk to Precisely about the best practices we recommend for log archiving.
- Use SQDUtil
- Use a special Engine and Kafka utility