Confluent 5.0.0 kafka connect hdfs sink: Cannot describe the kafka connect consumer group lag after upgrade
Confluent 5.0.0 kafka connect hdfs sink: Cannot describe the kafka connect consumer group lag after upgrade
We upgraded from Confluent 4.0.0 to 5.0.0, after upgrading we cannot list the kafka connect hdfs sink connector consumer lag.
$ /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server <hostname>:9092 --list | grep scribe_log_backend
Note: This will not show information about old Zookeeper-based consumers.
connect-kconn_scribe_log_backend
$ /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server <hostname>:9092 --group connect-kconn_scribe_log_backend --describe
Note: This will not show information about old Zookeeper-based consumers.
$
Were there any modifications done to the consumer group command in kafka 2.0/confluent 5.0.0 ? How do i track the lag we need to alert based on this lag?
Our brokers run on kafka version 1.1.0.
Also cannot see the connect consumer group in kafka manager after the upgrade.
There is no issue with kafka connect as the connector is able to write to hdfs.
Thanks.
1 Answer
1
Kafka Connect HDFS connector no longer commits offsets so there is nothing to base lag calculation on.
PS.
The recovery is based on file naming convention is HDFS, the file names have partition and offset info.
I think we have to change the alerting logic to find the lag based on difference between max offset written by connector in hdfs per kafka partition and corresponding latest offset of the kafka topic per kafka partition.
– Rupesh More
Sep 22 '18 at 21:42
Hi Rupesh, sorry for the delay in replying. We've had a similar issue with Confluent 4 platform; Connect HDFS tasks would fail and not recover because of active leases on WAL files in HDFS. A sure way of reproducing this is to restart HDFS name node. We had alerting setup based on partitions created in Hive metastore, i.e.: if we don't see a new partition for the current hour within a few minutes of the hour rolling over, alert. We haven't see this problem on Confluent 5.0 platform though and have deprecated these alerts.
– Andrei Leibovski
Oct 19 '18 at 20:35
hi @AndreiLeibovski, we are still seeing this issue with connect-hdfs 5.0.0. A restart of the datanode starts causing issues with acquiring the lease to the WAL. Recoverlease doesn't help. Any suggestions to overcome this from the connect's side?
– swamoch
Mar 1 at 9:16
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy
Hi Andrei, Thank you for your response. I think even with latest Confluent 5.0.0 we face issues like : github.com/confluentinc/kafka-connect-hdfs/issues/372 in production where we have seen data loss for the impacted kafka partitions as after connector task fail, another task cannot acquire lease on the log file leading to continuous connector task failures. We use this connector lag if it crosses a threshold to alert us so that we can manually delete these WAL log files and the the impacted partition recovers.
– Rupesh More
Sep 22 '18 at 21:32