mdaniel parent
Anything in my life that uses Zookeeper or its dumbass etcd friend means I'm going to have a real bad time. I am thankful they're at least shipping their own ZK-ish but it seems to have fallen into the same trap as etcd, where membership has to be managed like the precious little pets that they are https://clickhouse.com/docs/guides/sre/keeper/clickhouse-kee...
Zookeeper in the only clustering product I’ve ever used that actively refused to start a cluster after an all-nodes stop/start.
It blows my mind that a high availability system would purposefully prevent availability as a “feature”.
Although this is oversimplifying things [0], in the face of partitions zookeeper emphasizes consistency over availability.
[0] https://martin.kleppmann.com/2015/05/11/please-stop-calling-...
The problem with that is all nodes stop-start is not a partition!
A partition is when some nodes can’t reach other nodes.
Zookeeper instead has an issue where it does try to restart but the timeout (why?!) is too short, something like 30 seconds. If the majority of your nodes don’t all start within a certain time window the whole cluster stays down until someone manually intervenes.
I discovered this fun feature when keeping non-prod systems off to save money in the cloud.
It also has an impact when making certain big bang changes in production.