Fixing a Broken etcd Cluster Member


Sometimes etcd nodes can fail and show scary errors. For example:

the member has been permanently removed from the cluster
data-dir used by this member must be removed
failed to create snapshot directory: permission denied
discovery failed: member has already been bootstrapped

These errors usually mean that the node is broken and cannot join the cluster anymore. In my case, it was a corrupted data disk drive caused the node to fail.

This post is based on an etcd cluster with 3 nodes, running on Oracle Linux, with TLS enabled and data stored in /mnt/etcd.
Failed node is 3.etcd.corp.net, healthy nodes are 1 and 2

First, connect to a healthy etcd node and list cluster members:

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member list

Find the ID of the broken node, and then remove the bad node from the cluster:

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member remove <id here>

On the broken node, delete old data (re-check a path!):

rm -rf /mnt/etcd/member

From a healthy node, add the member again:

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member add 3.etcd.corp.net \
--peer-urls="https://3.etcd.corp.net:2380"

A –learner flag can be used as well that adds the node as a learner member, not a full voting member. Then, this node sync data from a leader safely and you can promote it with etcdctl member promote

On the failed node, set these variables:

export ETCD_NAME="3.etcd.corp.net"
export ETCD_INITIAL_CLUSTER="3.etcd.corp.net=https://3.etcd.corp.net:2380,1.etcd.corp.net=https://1.etcd.corp.net:2380,2.etcd.corp.net=https://2.etcd.corp.net:2380"
export ETCD_INITIAL_ADVERTISE_PEER_URLS="https://3.etcd.corp.net:2380"
export ETCD_INITIAL_CLUSTER_STATE="existing"

and run this to join the node back to the cluster:

etcd \
--data-dir=/mnt/etcd \
--listen-peer-urls=https://10.10.23.3:2380 \
--listen-client-urls=https://10.10.23.3:2379,http://127.0.0.1:2379 \
--advertise-client-urls=https://3.etcd.corp.net:2379 \
--initial-cluster-token=etcd.cluster \
--cert-file=/etc/etcd/ssl/server.crt \
--key-file=/etc/etcd/ssl/server.key \
--trusted-ca-file=/etc/etcd/ssl/ca.crt \
--client-cert-auth=true \
--peer-cert-file=/etc/etcd/ssl/server.crt \
--peer-key-file=/etc/etcd/ssl/server.key \
--peer-trusted-ca-file=/etc/etcd/ssl/ca.crt \
--peer-client-cert-auth=true \
--auto-compaction-mode=revision \
--auto-compaction-retention=1000 \
--snapshot-count=10000 \
--heartbeat-interval=500 \
--election-timeout=5000

Verify permissions and correct them if needed:

chown -R etcd:etcd /mnt/etcd

Now restart the systemd service (or whatever process manager you use) so the node runs in the background normally: systemctl restart etcd

Finally, check the cluster status from any node. All three members should show started.

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member list

Expected Output:

84c8bdaa4e889cbe, started, 1.etcd.corp.net, https://1.etcd.corp.net:2380, https://1.etcd.corp.net:2379, false
9aa5ae863a92585c, started, 2.etcd.corp.net, https://2.etcd.corp.net:2380, https://2.etcd.corp.net:2379, false
eb7647b280e36f87, started, 3.etcd.corp.net, https://3.etcd.corp.net:2380, https://3.etcd.corp.net:2379, false

Hope this saves you some headache

Leave a comment