Fixing a Broken etcd Cluster Member

Sometimes etcd nodes can fail and show scary errors. For example:

the member has been permanently removed from the cluster
data-dir used by this member must be removed
failed to create snapshot directory: permission denied
discovery failed: member has already been bootstrapped

These errors usually mean that the node is broken and cannot join the cluster anymore. In my case, it was a corrupted data disk drive caused the node to fail.

This post is based on an etcd cluster with 3 nodes, running on Oracle Linux, with TLS enabled and data stored in /mnt/etcd.
Failed node is 3.etcd.corp.net, healthy nodes are 1 and 2

First, connect to a healthy etcd node and list cluster members:

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member list

Find the ID of the broken node, and then remove the bad node from the cluster:

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member remove <id here>

On the broken node, delete old data (re-check a path!):

rm -rf /mnt/etcd/member

From a healthy node, add the member again:

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member add 3.etcd.corp.net \
--peer-urls="https://3.etcd.corp.net:2380"

A –learner flag can be used as well that adds the node as a learner member, not a full voting member. Then, this node sync data from a leader safely and you can promote it with etcdctl member promote

On the failed node, set these variables:

export ETCD_NAME="3.etcd.corp.net"
export ETCD_INITIAL_CLUSTER="3.etcd.corp.net=https://3.etcd.corp.net:2380,1.etcd.corp.net=https://1.etcd.corp.net:2380,2.etcd.corp.net=https://2.etcd.corp.net:2380"
export ETCD_INITIAL_ADVERTISE_PEER_URLS="https://3.etcd.corp.net:2380"
export ETCD_INITIAL_CLUSTER_STATE="existing"

and run this to join the node back to the cluster:

etcd \
--data-dir=/mnt/etcd \
--listen-peer-urls=https://10.10.23.3:2380 \
--listen-client-urls=https://10.10.23.3:2379,http://127.0.0.1:2379 \
--advertise-client-urls=https://3.etcd.corp.net:2379 \
--initial-cluster-token=etcd.cluster \
--cert-file=/etc/etcd/ssl/server.crt \
--key-file=/etc/etcd/ssl/server.key \
--trusted-ca-file=/etc/etcd/ssl/ca.crt \
--client-cert-auth=true \
--peer-cert-file=/etc/etcd/ssl/server.crt \
--peer-key-file=/etc/etcd/ssl/server.key \
--peer-trusted-ca-file=/etc/etcd/ssl/ca.crt \
--peer-client-cert-auth=true \
--auto-compaction-mode=revision \
--auto-compaction-retention=1000 \
--snapshot-count=10000 \
--heartbeat-interval=500 \
--election-timeout=5000

Verify permissions and correct them if needed:

chown -R etcd:etcd /mnt/etcd

Now restart the systemd service (or whatever process manager you use) so the node runs in the background normally: systemctl restart etcd

Finally, check the cluster status from any node. All three members should show started.

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member list

Expected Output:

84c8bdaa4e889cbe, started, 1.etcd.corp.net, https://1.etcd.corp.net:2380, https://1.etcd.corp.net:2379, false
9aa5ae863a92585c, started, 2.etcd.corp.net, https://2.etcd.corp.net:2380, https://2.etcd.corp.net:2379, false
eb7647b280e36f87, started, 3.etcd.corp.net, https://3.etcd.corp.net:2380, https://3.etcd.corp.net:2379, false

Hope this saves you some headache

How to move ClickHouse data to a new partition

Before proceeding with any steps, please make sure to create a complete backup of your ClickHouse data. In the post, I assume you have an additional disk without any partitions on it.

Start by creating a new partition (LVM is being used below). If you have a cluster, repeat the steps on each node.

# Create partition
lsblk # get dev name
fdisk /dev/sdb # use 8e type, other settings are default
lsblk # check
pvcreate /dev/sdb1 # create a volume
pvdisplay # check volumes
vgcreate clickhouse /dev/sdb1 # create a volume group
lvcreate --name data -l 100%FREE clickhouse # create a logical volume
mkfs.ext4 /dev/clickhouse/data # make ext4 fs

Add a new mount point to the /etc/fstab:

# edit fstab, following best practices - use noatime option
# /etc/fstab, use UUID or /var/lib/clickhouse defaults,noatime
# if UUID is used, run blkid /dev/mapper/clickhouse-data

# Example
/dev/mapper/clickhouse-data  /var/lib/clickhouse ext4  defaults,noatime     0       0

If you have a cluster, identify the shard/replica and check the replication queue.

SELECT database,table,source_replica FROM system.replication_queue;

SELECT cluster,host_name,shard_num,shard_weight,replica_num FROM system.clusters ORDER BY shard_num;

On each replica in a shard, one by one:

# Stop ch server
sudo systemctl stop clickhouse-server

# prepare dirs
mv /var/lib/clickhouse /var/lib/clickhouse-tmp
mkdir /var/lib/clickhouse
chown clickhouse:clickhouse /var/lib/clickhouse

# activate the mount defined in the fstab 
mount /var/lib/clickhouse 

# copy data
cp -R /var/lib/clickhouse-tmp/* /var/lib/clickhouse/
chown -R clickhouse:clickhouse /var/lib/clickhouse

# get ch server back
sudo systemctl start clickhouse-server

Check the databases, tables, and ClickHouse server state (error logs; usually they are located here: /var/log/clickhouse-server/clickhouse-server.err.log).

If everything works fine, delete the temporary directory rm -rf /var/lib/clickhouse-tmp and check disk space with df -h