问题现象
Kafka 集群有 3 个节点,其中一个节点挂掉了。这时候,部分 group 可以消费消息,但是有一部分 group 存在消息无法消费的情况。重启服务后正常。
按理说,Kafka 集群已经保证了高可用,为什么会出现一台 down 掉服务却不可用了呢?
网上搜了下,大概率是需要调整 kafka 的主题__consumer_offsets 的副本数量。
确认__consumer_offsets 的主题信息
执行:
bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic __consumer_offsets
Topic:__consumer_offsets PartitionCount:50 ReplicationFactor:1 Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer
Topic: __consumer_offsets Partition: 0 Leader: 12 Replicas: 12 Isr: 12
Topic: __consumer_offsets Partition: 1 Leader: 87 Replicas: 87 Isr: 87
Topic: __consumer_offsets Partition: 2 Leader: 11 Replicas: 11 Isr: 11
Topic: __consumer_offsets Partition: 3 Leader: 12 Replicas: 12 Isr: 12
....
确认 ReplicationFactor 是 1,说明 topic 存在单点问题:kafka 会把消息按组的形式放到一个 partition 里,每个 group 消费一个 partition,比如上面的 12 挂了,partition 0 和 3 的消息就无法进行消息处理。
修改分区
- 创建副本调整的 json 文件,执行以下命令,注意最后的 replicas 是集群的 id 列表:
cat > increase-replication-factor.json <<EOF
{"version":1, "partitions":[
{"topic":"__consumer_offsets","partition":0,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":1,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":2,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":3,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":4,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":5,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":6,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":7,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":8,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":9,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":10,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":11,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":12,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":13,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":14,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":15,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":16,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":17,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":18,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":19,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":20,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":21,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":22,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":23,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":24,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":25,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":26,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":27,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":28,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":29,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":30,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":31,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":32,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":33,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":34,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":35,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":36,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":37,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":38,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":39,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":40,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":41,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":42,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":43,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":44,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":45,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":46,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":47,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":48,"replicas":[11,12,87]},
{"topic":"__consumer_offsets","partition":49,"replicas":[11,12,87]}]
}
EOF
- 修改分区命令执行
bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file increase-replication-factor.json --execute
- 验证下分区是否执行成功
bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file increase-replication-factor.json --verify
结果:
Status of partition reassignment:
Reassignment of partition __consumer_offsets-22 completed successfully
Reassignment of partition __consumer_offsets-30 completed successfully
Reassignment of partition __consumer_offsets-8 completed successfully
....
检查 Topic 信息
bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic __consumer_offsets
发现已经成功了:
Topic:__consumer_offsets PartitionCount:50 ReplicationFactor:3 Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer
Topic: __consumer_offsets Partition: 0 Leader: 11 Replicas: 11,12,87 Isr: 12,11,87
Topic: __consumer_offsets Partition: 1 Leader: 11 Replicas: 11,12,87 Isr: 87,11,12
Topic: __consumer_offsets Partition: 2 Leader: 11 Replicas: 11,12,87 Isr: 11,12,87
Topic: __consumer_offsets Partition: 3 Leader: 11 Replicas: 11,12,87 Isr: 12,11,87
....
验证可用性
将其中一台的 kafka kill 掉,尝试在应用中发送消息,看消费端这时候能否消费到。
2019-01-22 13:25:54.038 WARN 37504 --- [ntainer#0-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-2, groupId=bonus-coin] Connection to node -2 could not be established. Broker may not be available.
2019-01-22 13:26:34.360 INFO 37504 --- [ntainer#0-0-C-1] q.b.c.c.CounterChangeMessageCoinConsumer : coin:counter message received! topic = bonus_counter_change_xiajinlong2, key = THUMBS_UP, offset = 27, value = {"actionId":2,"actionLogId":186,"cancel":false,"countType":"THUMBS_UP","createDate":"2019-01-22","createTime":"2019-01-22T13:26:34.304","currentCancelNum":0,"currentNum":1,"eventId":2,"staticsType":"USER_COUNTER","userCounter":{"cancelNum":0,"countDate":"2019-01-22","countType":"thumbs_up","id":161,"num":1,"updateTime":"2019-01-22T13:26:34.303","userId":"xjl222"},"userId":"xjl222"}
2019-01-22 13:26:34.466 INFO 37504 --- [ntainer#0-0-C-1] o.h.h.i.QueryTranslatorFactoryInitiator : HHH000397: Using ASTQueryTranslatorFactory
2019-01-22 13:26:34.698 INFO 37504 --- [ntainer#0-0-C-1] q.b.c.c.CounterChangeMessageCoinConsumer : coin:counter message success handled!
2019-01-22 13:26:34.698 INFO 37504 --- [ntainer#0-0-C-1] q.b.c.c.CounterChangeMessageCoinConsumer : coin:counter message received! topic = bonus_counter_change_xiajinlong2, key = BE_THUMBS_UP_ED, offset = 27, value = {"actionId":2,"actionLogId":186,"cancel":false,"countType":"BE_THUMBS_UP_ED","createDate":"2019-01-22","createTime":"2019-01-22T13:26:34.343","currentCancelNum":0,"currentNum":1,"eventId":3,"staticsType":"USER_COUNTER","userCounter":{"cancelNum":0,"countDate":"2019-01-22","countType":"be_thumbs_up_ed","id":162,"num":1,"updateTime":"2019-01-22T13:26:34.343","userId":"xjl242"},"userId":"xjl242"}
2019-01-22 13:26:34.753 INFO 37504 --- [ntainer#0-0-C-1] q.b.c.c.CounterChangeMessageCoinConsumer : coin:counter message success handled!
2019-01-22 13:25:53.060 WARN 47832 --- [ntainer#0-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-2, groupId=bonus-exp] Connection to node -2 could not be established. Broker may not be available.
2019-01-22 13:26:34.359 INFO 47832 --- [ntainer#0-0-C-1] .q.b.e.c.CounterChangeMessageExpConsumer : exp:counter message received! topic = bonus_counter_change_xiajinlong2, key = THUMBS_UP, offset = 27, value = {"actionId":2,"actionLogId":186,"cancel":false,"countType":"THUMBS_UP","createDate":"2019-01-22","createTime":"2019-01-22T13:26:34.304","currentCancelNum":0,"currentNum":1,"eventId":2,"staticsType":"USER_COUNTER","userCounter":{"cancelNum":0,"countDate":"2019-01-22","countType":"thumbs_up","id":161,"num":1,"updateTime":"2019-01-22T13:26:34.303","userId":"xjl222"},"userId":"xjl222"}
2019-01-22 13:26:34.460 INFO 47832 --- [ntainer#0-0-C-1] o.h.h.i.QueryTranslatorFactoryInitiator : HHH000397: Using ASTQueryTranslatorFactory
2019-01-22 13:26:34.691 INFO 47832 --- [ntainer#0-0-C-1] .q.b.e.c.CounterChangeMessageExpConsumer : exp:counter message success handled!
2019-01-22 13:26:34.691 INFO 47832 --- [ntainer#0-0-C-1] .q.b.e.c.CounterChangeMessageExpConsumer : exp:counter message received! topic = bonus_counter_change_xiajinlong2, key = BE_THUMBS_UP_ED, offset = 27, value = {"actionId":2,"actionLogId":186,"cancel":false,"countType":"BE_THUMBS_UP_ED","createDate":"2019-01-22","createTime":"2019-01-22T13:26:34.343","currentCancelNum":0,"currentNum":1,"eventId":3,"staticsType":"USER_COUNTER","userCounter":{"cancelNum":0,"countDate":"2019-01-22","countType":"be_thumbs_up_ed","id":162,"num":1,"updateTime":"2019-01-22T13:26:34.343","userId":"xjl242"},"userId":"xjl242"}
2019-01-22 13:26:34.757 INFO 47832 --- [ntainer#0-0-C-1] .q.b.e.c.CounterChangeMessageExpConsumer : exp:counter message success handled!
通过日志确认消息在 kafka 挂掉其中部分节点时可以正常消费。
欢迎来到这里!
我们正在构建一个小众社区,大家在这里相互信任,以平等 • 自由 • 奔放的价值观进行分享交流。最终,希望大家能够找到与自己志同道合的伙伴,共同成长。
注册 关于