Sometimes etcd nodes can fail and show scary errors. For example:
the member has been permanently removed from the clusterdata-dir used by this member must be removedfailed to create snapshot directory: permission denieddiscovery failed: member has already been bootstrapped
These errors usually mean that the node is broken and cannot join the cluster anymore. In my case, it was a corrupted data disk drive caused the node to fail.
This post is based on an etcd cluster with 3 nodes, running on Oracle Linux, with TLS enabled and data stored in
/mnt/etcd.
Failed node is 3.etcd.corp.net, healthy nodes are 1 and 2
First, connect to a healthy etcd node and list cluster members:
etcdctl --endpoints=https://1.etcd.corp.net:2379 \--cacert=/etc/etcd/ssl/ca.crt \--cert=/etc/etcd/ssl/server.crt \--key=/etc/etcd/ssl/server.key \member list
Find the ID of the broken node, and then remove the bad node from the cluster:
etcdctl --endpoints=https://1.etcd.corp.net:2379 \--cacert=/etc/etcd/ssl/ca.crt \--cert=/etc/etcd/ssl/server.crt \--key=/etc/etcd/ssl/server.key \member remove <id here>
On the broken node, delete old data (re-check a path!):
rm -rf /mnt/etcd/member
From a healthy node, add the member again:
etcdctl --endpoints=https://1.etcd.corp.net:2379 \--cacert=/etc/etcd/ssl/ca.crt \--cert=/etc/etcd/ssl/server.crt \--key=/etc/etcd/ssl/server.key \member add 3.etcd.corp.net \--peer-urls="https://3.etcd.corp.net:2380"
A –learner flag can be used as well that adds the node as a learner member, not a full voting member. Then, this node sync data from a leader safely and you can promote it with etcdctl member promote
On the failed node, set these variables:
export ETCD_NAME="3.etcd.corp.net"export ETCD_INITIAL_CLUSTER="3.etcd.corp.net=https://3.etcd.corp.net:2380,1.etcd.corp.net=https://1.etcd.corp.net:2380,2.etcd.corp.net=https://2.etcd.corp.net:2380"export ETCD_INITIAL_ADVERTISE_PEER_URLS="https://3.etcd.corp.net:2380"export ETCD_INITIAL_CLUSTER_STATE="existing"
and run this to join the node back to the cluster:
etcd \ --data-dir=/mnt/etcd \ --listen-peer-urls=https://10.10.23.3:2380 \ --listen-client-urls=https://10.10.23.3:2379,http://127.0.0.1:2379 \ --advertise-client-urls=https://3.etcd.corp.net:2379 \ --initial-cluster-token=etcd.cluster \ --cert-file=/etc/etcd/ssl/server.crt \ --key-file=/etc/etcd/ssl/server.key \ --trusted-ca-file=/etc/etcd/ssl/ca.crt \ --client-cert-auth=true \ --peer-cert-file=/etc/etcd/ssl/server.crt \ --peer-key-file=/etc/etcd/ssl/server.key \ --peer-trusted-ca-file=/etc/etcd/ssl/ca.crt \ --peer-client-cert-auth=true \ --auto-compaction-mode=revision \ --auto-compaction-retention=1000 \ --snapshot-count=10000 \ --heartbeat-interval=500 \ --election-timeout=5000
Verify permissions and correct them if needed:
chown -R etcd:etcd /mnt/etcd
Now restart the systemd service (or whatever process manager you use) so the node runs in the background normally: systemctl restart etcd
Finally, check the cluster status from any node. All three members should show started.
etcdctl --endpoints=https://1.etcd.corp.net:2379 \--cacert=/etc/etcd/ssl/ca.crt \--cert=/etc/etcd/ssl/server.crt \--key=/etc/etcd/ssl/server.key \member list
Expected Output:
84c8bdaa4e889cbe, started, 1.etcd.corp.net, https://1.etcd.corp.net:2380, https://1.etcd.corp.net:2379, false9aa5ae863a92585c, started, 2.etcd.corp.net, https://2.etcd.corp.net:2380, https://2.etcd.corp.net:2379, falseeb7647b280e36f87, started, 3.etcd.corp.net, https://3.etcd.corp.net:2380, https://3.etcd.corp.net:2379, false
Hope this saves you some headache