Fixing a Broken etcd Cluster Member

Sometimes etcd nodes can fail and show scary errors. For example:

the member has been permanently removed from the cluster
data-dir used by this member must be removed
failed to create snapshot directory: permission denied
discovery failed: member has already been bootstrapped

These errors usually mean that the node is broken and cannot join the cluster anymore. In my case, it was a corrupted data disk drive caused the node to fail.

This post is based on an etcd cluster with 3 nodes, running on Oracle Linux, with TLS enabled and data stored in /mnt/etcd.
Failed node is 3.etcd.corp.net, healthy nodes are 1 and 2

First, connect to a healthy etcd node and list cluster members:

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member list

Find the ID of the broken node, and then remove the bad node from the cluster:

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member remove <id here>

On the broken node, delete old data (re-check a path!):

rm -rf /mnt/etcd/member

From a healthy node, add the member again:

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member add 3.etcd.corp.net \
--peer-urls="https://3.etcd.corp.net:2380"

A –learner flag can be used as well that adds the node as a learner member, not a full voting member. Then, this node sync data from a leader safely and you can promote it with etcdctl member promote

On the failed node, set these variables:

export ETCD_NAME="3.etcd.corp.net"
export ETCD_INITIAL_CLUSTER="3.etcd.corp.net=https://3.etcd.corp.net:2380,1.etcd.corp.net=https://1.etcd.corp.net:2380,2.etcd.corp.net=https://2.etcd.corp.net:2380"
export ETCD_INITIAL_ADVERTISE_PEER_URLS="https://3.etcd.corp.net:2380"
export ETCD_INITIAL_CLUSTER_STATE="existing"

and run this to join the node back to the cluster:

etcd \
--data-dir=/mnt/etcd \
--listen-peer-urls=https://10.10.23.3:2380 \
--listen-client-urls=https://10.10.23.3:2379,http://127.0.0.1:2379 \
--advertise-client-urls=https://3.etcd.corp.net:2379 \
--initial-cluster-token=etcd.cluster \
--cert-file=/etc/etcd/ssl/server.crt \
--key-file=/etc/etcd/ssl/server.key \
--trusted-ca-file=/etc/etcd/ssl/ca.crt \
--client-cert-auth=true \
--peer-cert-file=/etc/etcd/ssl/server.crt \
--peer-key-file=/etc/etcd/ssl/server.key \
--peer-trusted-ca-file=/etc/etcd/ssl/ca.crt \
--peer-client-cert-auth=true \
--auto-compaction-mode=revision \
--auto-compaction-retention=1000 \
--snapshot-count=10000 \
--heartbeat-interval=500 \
--election-timeout=5000

Verify permissions and correct them if needed:

chown -R etcd:etcd /mnt/etcd

Now restart the systemd service (or whatever process manager you use) so the node runs in the background normally: systemctl restart etcd

Finally, check the cluster status from any node. All three members should show started.

etcdctl --endpoints=https://1.etcd.corp.net:2379 \
--cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/server.crt \
--key=/etc/etcd/ssl/server.key \
member list

Expected Output:

84c8bdaa4e889cbe, started, 1.etcd.corp.net, https://1.etcd.corp.net:2380, https://1.etcd.corp.net:2379, false
9aa5ae863a92585c, started, 2.etcd.corp.net, https://2.etcd.corp.net:2380, https://2.etcd.corp.net:2379, false
eb7647b280e36f87, started, 3.etcd.corp.net, https://3.etcd.corp.net:2380, https://3.etcd.corp.net:2379, false

Hope this saves you some headache

Why GitLab Fails with “Operation Not Permitted” on Windows Using Podman

If you run GitLab (or any application that modifies file permissions or ownership of files in volume mounts) in a container, you may see the installation fail with an error like:

chgrp: changing group of '/var/opt/gitlab/git-data/repositories': Operation not permitted

This error prevents GitLab from starting. Here’s why it happens—and the simplest way to fix it.

A local GitLab installation was required to troubleshoot and verify several production-critical queries. This setup is clearly not intended for production use and should be used only for testing and troubleshooting purposes.

Also, Windows is not officially supported as the images have known compatibility issues with volume permissions and potentially other unknown issues (although, I haven’t noticed any issues during a week)

Both Podman Desktop and Docker Desktop run containers by using WSL2

The problem appears when you bind-mount a Windows directory (NTFS) into the container, for example:

E:\volumes\gitlab\data → /var/opt/gitlab
podman run --detach --hostname gitlab.example.com `
--env GITLAB_OMNIBUS_CONFIG="external_url 'http://gitlab.example.com'" `
--publish 443:443 --publish 80:80 --publish 22:22 ` 
--name gitlab --restart always `  
--volume /e/volumes/gitlab/config:/etc/gitlab `
--volume /e/volumes/gitlab/logs:/var/log/gitlab `
--volume /e/volumes/gitlab/data:/var/opt/gitlab `
gitlab/gitlab-ce:18.5.4-ce.0

The same command works fine with Docker Desktop (E is an external disk drive available to Windows host)

What goes wrong

So far, we have the following flow:

  • GitLab requires real Linux filesystem permissions and ownership
  • During startup, it runs chown and chgrp on its data directories
  • Windows filesystems (NTFS) do not support Linux UID/GID ownership
  • WSL2 cannot translate these permission changes correctly
  • The operation fails, and GitLab refuses to start

If both Podman and Docker are based on WSL2, why does Docker run GitLab on an E: drive without breaking a sweat? The root cause is the difference in how Docker and Podman translate file permissions.

Docker: if GitLab calls chgrp, WSL’s drvfs layer intercepts the call. It doesn’t actually change the Windows folder, but it records the “permission change” in a hidden metadata area (NTFS Extended Attributes).

/etc/wsl.conf content of the docker desktop engine:

[automount]
root = /mnt/host
options = "metadata"
[interop]
enabled = true

When metadata is enabled as a mount option in WSL, extended attributes on Windows NT files can be added and interpreted to supply Linux file system permissions.

Podman: mounts Windows drives using the standard WSL2 9p protocol and drvfs driver (as Docker actually) without the complex metadata mapping enabled by default. When GitLab/your app tries to set its required ownership, the mount simply refuses, causing the container to crash

Here is an output for E disk drive mount from the podman machine:

mount | grep " /mnt/e "
E:\ on /mnt/e type 9p (rw,noatime,aname=drvfs;path=E:\;uid=1000;gid=1000;symlinkroot=/mnt/,cache=5,access=client,msize=65536,trans=fd,rfd=5,wfd=5)

there is no metadata option for the mount because of such simple wsl.conf:

[user]
default=user

Solution

The easiest solution here is to use named volumes (universal and faster) or a bind mount (if Docker is used; slower); custom wsl.conf and bind mount (if Podman is used; slower)

Named volumes:

podman run --detach --hostname gitlab.example.com `
--env GITLAB_OMNIBUS_CONFIG="external_url 'http://gitlab.example.com'" ` 
--publish 443:443 --publish 80:80 --publish 22:22 ` 
--name gitlab --restart always ` 
--volume gitlab-config:/etc/gitlab `
--volume gitlab-logs:/var/log/gitlab ` 
--volume gitlab-data:/var/opt/gitlab `
gitlab/gitlab-ce:18.5.4-ce.0

and the data will be stored at /var/lib/containers/storage/volumes (podman machine in this example):

Can be accessed from Windows Explorer as well:

  • Docker: \\wsl$\docker-desktop\mnt\docker-desktop-disk\data\docker\volumes
  • Podman: \\wsl$\podman-machine-default\var\lib\containers\storage\volumes

Bind mounts:

docker run --detach `
  --hostname gitlab.example.com `
  --publish 443:443 --publish 80:80 --publish 22:22 `
  --name gitlab-bind-mount `
  --restart always `
  --volume /e/volumes/gitlab/config:/etc/gitlab `
  --volume /e/volumes/gitlab/logs:/var/log/gitlab `
  --volume /e/volumes/gitlab/data:/var/opt/gitlab `
  gitlab/gitlab-ce:18.5.4-ce.0

Custom wsl.conf (podman):

[automount]
options = "metadata"

[user]
default=user

[interop] enabled=true is not actually required since it’s true by default, then restart podman and try podman run again


Docker and Podman use different WSL default configurations. Docker tolerates emulated ownership changes by enabling the metadata option out of the box.

Podman, on the other hand, does not rely on this metadata and expects real Linux filesystem behavior. It is also daemonless and lighter than Docker—but that’s a story for another blog post.