Confluent 5.0.0 kafka connect hdfs sink: Cannot describe the kafka connect consumer group lag after upgrade

Confluent 5.0.0 kafka connect hdfs sink: Cannot describe the kafka connect consumer group lag after upgrade



We upgraded from Confluent 4.0.0 to 5.0.0, after upgrading we cannot list the kafka connect hdfs sink connector consumer lag.


$ /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server <hostname>:9092 --list | grep scribe_log_backend
Note: This will not show information about old Zookeeper-based consumers.
connect-kconn_scribe_log_backend
$ /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server <hostname>:9092 --group connect-kconn_scribe_log_backend --describe
Note: This will not show information about old Zookeeper-based consumers.
$



Were there any modifications done to the consumer group command in kafka 2.0/confluent 5.0.0 ? How do i track the lag we need to alert based on this lag?
Our brokers run on kafka version 1.1.0.
Also cannot see the connect consumer group in kafka manager after the upgrade.
There is no issue with kafka connect as the connector is able to write to hdfs.
Thanks.




1 Answer
1



Kafka Connect HDFS connector no longer commits offsets so there is nothing to base lag calculation on.



PS.
The recovery is based on file naming convention is HDFS, the file names have partition and offset info.






Hi Andrei, Thank you for your response. I think even with latest Confluent 5.0.0 we face issues like : github.com/confluentinc/kafka-connect-hdfs/issues/372 in production where we have seen data loss for the impacted kafka partitions as after connector task fail, another task cannot acquire lease on the log file leading to continuous connector task failures. We use this connector lag if it crosses a threshold to alert us so that we can manually delete these WAL log files and the the impacted partition recovers.

– Rupesh More
Sep 22 '18 at 21:32







I think we have to change the alerting logic to find the lag based on difference between max offset written by connector in hdfs per kafka partition and corresponding latest offset of the kafka topic per kafka partition.

– Rupesh More
Sep 22 '18 at 21:42






Hi Rupesh, sorry for the delay in replying. We've had a similar issue with Confluent 4 platform; Connect HDFS tasks would fail and not recover because of active leases on WAL files in HDFS. A sure way of reproducing this is to restart HDFS name node. We had alerting setup based on partitions created in Hive metastore, i.e.: if we don't see a new partition for the current hour within a few minutes of the hour rolling over, alert. We haven't see this problem on Confluent 5.0 platform though and have deprecated these alerts.

– Andrei Leibovski
Oct 19 '18 at 20:35






hi @AndreiLeibovski, we are still seeing this issue with connect-hdfs 5.0.0. A restart of the datanode starts causing issues with acquiring the lease to the WAL. Recoverlease doesn't help. Any suggestions to overcome this from the connect's side?

– swamoch
Mar 1 at 9:16



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)