rlevchenko

Sometimes etcd nodes can fail and show scary errors. For example:

			
the member has been permanently removed from the cluster
data-dir used by this member must be removed
failed to create snapshot directory: permission denied
discovery failed: member has already been bootstrapped

These errors usually mean that the node is broken and cannot join the cluster anymore. In my case, it was a corrupted data disk drive caused the node to fail.

This post is based on an etcd cluster with 3 nodes, running on Oracle Linux, with TLS enabled and data stored in /mnt/etcd.
Failed node is 3.etcd.corp.net, healthy nodes are 1 and 2

First, connect to a healthy etcd node and list cluster members:

			
etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member list

		

Find the ID of the broken node, and then remove the bad node from the cluster:

			
etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member remove <id here>

		

On the broken node, delete old data (re-check a path!):

rm -rf /mnt/etcd/member

From a healthy node, add the member again:

			
etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member add 3.etcd.corp.net \
--peer-urls="https://3.etcd.corp.net:2380"

		

A –learner flag can be used as well that adds the node as a learner member, not a full voting member. Then, this node sync data from a leader safely and you can promote it with etcdctl member promote

On the failed node, set these variables:

			
export ETCD_NAME="3.etcd.corp.net"
export ETCD_INITIAL_CLUSTER="3.etcd.corp.net=https://3.etcd.corp.net:2380,1.etcd.corp.net=https://1.etcd.corp.net:2380,2.etcd.corp.net=https://2.etcd.corp.net:2380"
export ETCD_INITIAL_ADVERTISE_PEER_URLS="https://3.etcd.corp.net:2380"
export ETCD_INITIAL_CLUSTER_STATE="existing"

and run this to join the node back to the cluster:

			
etcd \
  --data-dir=/mnt/etcd \
  --listen-peer-urls=https://10.10.23.3:2380 \
  --listen-client-urls=https://10.10.23.3:2379,http://127.0.0.1:2379 \
  --advertise-client-urls=https://3.etcd.corp.net:2379 \
  --initial-cluster-token=etcd.cluster \
  --cert-file=/etc/etcd/ssl/server.crt \
  --key-file=/etc/etcd/ssl/server.key \
  --trusted-ca-file=/etc/etcd/ssl/ca.crt \
  --client-cert-auth=true \
  --peer-cert-file=/etc/etcd/ssl/server.crt \
  --peer-key-file=/etc/etcd/ssl/server.key \
  --peer-trusted-ca-file=/etc/etcd/ssl/ca.crt \
  --peer-client-cert-auth=true \
  --auto-compaction-mode=revision \
  --auto-compaction-retention=1000 \
  --snapshot-count=10000 \
  --heartbeat-interval=500 \
  --election-timeout=5000

		

Verify permissions and correct them if needed:

chown -R etcd:etcd /mnt/etcd

Now restart the systemd service (or whatever process manager you use) so the node runs in the background normally: systemctl restart etcd

Finally, check the cluster status from any node. All three members should show started.

			
etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member list

		

Expected Output:

			
84c8bdaa4e889cbe, started, 1.etcd.corp.net, https://1.etcd.corp.net:2380, https://1.etcd.corp.net:2379, false
9aa5ae863a92585c, started, 2.etcd.corp.net, https://2.etcd.corp.net:2380, https://2.etcd.corp.net:2379, false
eb7647b280e36f87, started, 3.etcd.corp.net, https://3.etcd.corp.net:2380, https://3.etcd.corp.net:2379, false

Hope this saves you some headache

If your pipelines suddenly started failing with an error like:

Error response from daemon: client version 1.40 is too old. 
Minimum supported API version is 1.44

—even though you didn’t change anything, this is not a CI issue and not a random failure. Some breaking changes have been introduced recently.

Let’s say you have a Docker-in-Docker setup in GitLab CI, for example:

image: docker:stable # gets the latest stable 
services:
  - docker:dind # pulls the latest dind version
script:
  - docker login/build/push ...

..and this have worked for years

The problem is that docker:stable has not actually been updated for a long time. It’s effective version is Docker 19.03.14.

The docker:stable, docker:test, and related “channel” tags have been deprecated since June 2020 and have not been updated since December 2020 (when Docker 20.10 was released)

At the same time, docker:dind is actively updated and may now be running Docker 29.2.0 (as of February 1, 2026).

The docker login command is executed inside the job container (docker:stable), which contains the Docker CLI. That CLI sends requests to the Docker daemon running in the docker:dind service. With this version mismatch, the request now fails. Why?

Starting with Docker Engine 29, the Docker daemon enforces a minimum supported Docker API version and drops support for older clients entirely. This is a real breaking change, and it has a significant impact on CI systems — especially GitLab CI setups using the Docker executor with docker:dind.

The daemon now requires API version v1.44 or later (Docker v25.0+).

This would not have been an issue if best practices had been followed. GitLab documentation (and many other sources) clearly states:

You should always pin a specific version of the image, like docker:24.0.5. If you use a tag like docker:latest, you have no control over which version is used. This can cause incompatibility problems when new versions are released.

Another case illustrating why you should not use latest or any other tag that doesn’t allow you to control which version is used.

Solution 1 – use specific versions (25+ in this case; recommended)

image: docker:29.2.0 # or docker:29.2.0-cli which is lighter
services:
  - docker:29.2.0-dind
script:
  - docker login/build/push ...

Solution 2– set docker min api version for daemon:

  services:
    - name: docker:dind
      variables:
        DOCKER_MIN_API_VERSION: "1.40"

Solution 3 – change daemon.json with min api version:

{
  "min-api-version": "1.40"
}

1.40 represents docker:stable API version; in theory, Docker can break something again, so solution 1 is always preferable (for me, at least)

Hope it was helpful to someone.

Author: rlevchenko

Fixing a Broken etcd Cluster Member

Share this:

Share this: