Fixing GitLab Runner Issues on Kubernetes

Recently, I encountered and resolved several issues that were causing job failures and instability in AWS-based GitLab Runner setup on Kubernetes. In this post, I’ll walk through the errors, their causes, and the solutions that worked for me.

Job Timeout While Waiting for Pod to Start

ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start

Adjusting poll_interval and poll_timeout helped resolve the issue:

  • poll_timeout (default: 180s) – The maximum time the runner waits before timing out while connecting to a newly created pod.
  • poll_interval (default: 3s) – How frequently the runner checks the pod’s status.

By increasing poll_timeout (180s to 360s), the runner allowed more time for the pod to start, preventing such failures.

ErrImagePull: pull QPS exceeded

When the GitLab Runner starts multiple jobs that require pulling the same images (e.g., for services and builders), it can exceed the kubelet’s default pull rate limits:

  • registryPullQPS (default: 5) – Limits the number of image pulls per second.
  • registryBurst (default: 10) – Allows temporary bursts above registryPullQPS.

Instead of modifying kubelet parameters, I resolved this issue by changing the runner’s image pull policy from always to if-not-present to prevent unnecessary pulls:

pull_policy = ["if-not-present"] # default one
allowed_pull_policies = ["always", "if-not-present"] # allow to set always from pipeline if necessary

TLS Error When Preparing Environment

ERROR: Job failed (system failure): prepare environment: setting up trapping scripts on emptyDir: error dialing backend: remote error: tls: internal error

GitLab Runner communicates securely with the Kubernetes API to create executor pods. A TLS failure can occur due to API slowness, network issues, or misconfigured certificates

Setting the feature flag FF_WAIT_FOR_POD_TO_BE_REACHABLE to true helped resolve the issue by ensuring that the runner waits until the pod is fully reachable before proceeding. This can be set in the GitLab Runner configuration:

[runners.feature_flags]
  FF_WAIT_FOR_POD_TO_BE_REACHABLE = true

DNS Timeouts

dial tcp: lookup on <coredns ip>:53: read udp i/o timeout

While CoreDNS logs and network communication appeared normal, there was an unexpected spike in DNS load after launching more GitLab jobs than usual.

Scaling the CoreDNS deployment resolved the issue. Ideally, enabling automatic DNS horizontal autoscaling is preferred for handling load variations (check out kubernetes docs or cloud provider’s specific solution: they all share the same approach – add more replicas if increased load occurs)

kubectl scale deployments -n kube-system coredns --replicas=4

If you encountered other GitLab Runner issues, share them in comments 🙂

Cheers!

Solving Docker Hub rate limits in Kubernetes with containerd registry mirrors

When running Kubernetes workloads in AWS EKS (or any other environment), you may encounter the Docker Hub rate limit error:

429 Too Many Requests – Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading

Why are Docker Hub rate limits a problem? Docker Hub imposes strict pull rate limits: authenticated users up to 40 pulls per hour; anonymous users up to 10 pulls per hour; no pull rate limits for only paid authenticated users

To check your current limit state, you need to get token first:

For anonymous pulls:

TOKEN=$(curl "https://auth.docker.io/token?service=registry.docker.io&scope=repository:ratelimitpreview/test:pull" | jq -r .token)

For authenticated pulls:

TOKEN=$(curl --user 'username:password' "https://auth.docker.io/token?service=registry.docker.io&scope=repository:ratelimitpreview/test:pull" | jq -r .token)

and then make a request to get headers

curl --head -H "Authorization: Bearer $TOKEN" https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest

You should see the following headers (excerpt):


date: Sun, 26 Jan 2025 08:17:36 GMT
strict-transport-security: max-age=31536000
ratelimit-limit: 100;w=21600
ratelimit-remaining: 100;w=21600
docker-ratelimit-source: <hidden>

Possible solutions

  • DaemonSets that run predefined configuration of each your k8s node
  • For AWS-based clusters, EC2 Launch Template and it’s user data input
  • For AWS-based clusters, AWS Systems Manager and aws:runShellScript action
  • You can update the config manually, however, in most cases the cluster nodes have a short lifetime due to autoscaler (use the shell script from daemonset below, containerd service restart is not required)

In this guide, we will find out how to define DaemonSets in AWS EKS with containerd (containerd-1.7.11-1.amzn2.0.1.x86_64) and Kubernetes 1.30

  1. Check your containerd config at /etc/containerd/config.toml and make sure that the following is present config_path = "/etc/containerd/certs.d"
  2. Containerd registry host namespace configuration is stored at /etc/containerd/certs.d/hosts.toml
  3. The following manifest adds the required files and folders. Existing and future nodes will be automatically configured with the mirror by the DaemonSet. initContainer is used to update the node’s configuration, wait container is required to keep the DaemonSet active on nodes. Use taints and tolerations, change priority class or other fields to fit your requirements.
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    name: containerd-registry-mirror
    cluster: clustername
    otherlabel: labelvalue
  name: containerd-registry-mirror
spec:
  selector:
    matchLabels:
      name: containerd-registry-mirror
  template:
    metadata:
      labels:
        name: containerd-registry-mirror
    spec:
      nodeSelector:
        eks.amazonaws.com/nodegroup: poolname
      priorityClassName: system-node-critical
      initContainers:
      - image: alpine:3.21
        imagePullPolicy: IfNotPresent
        name: change-hosts-file-init
        command:
          - /bin/sh
          - -c
          - |
            #!/bin/sh
            set -euo pipefail
            TARGET="/etc/containerd/certs.d/docker.io" 
            cat << EOF > $TARGET/hosts.toml
            server = "https://registry-1.docker.io"

            [host."https://<your private registry>"]
              capabilities = ["pull", "resolve"]
            EOF
        resources:
          limits:
            cpu: 100m
            memory: 200Mi
          requests:
            cpu: 50m
            memory: 100Mi
        volumeMounts:
        - mountPath: /etc/containerd/certs.d/docker.io/
          name: docker-mirror
      containers:
      - name: wait
        image: registry.k8s.io/pause:3.9
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            cpu: 50m
            memory: 100Mi
          requests:
            cpu: 10m
            memory: 20Mi
      volumes:
      - name: docker-mirror
        hostPath:
          path: /etc/containerd/certs.d/docker.io/

5. Apply the manifest to your cluster:kubectl apply -f containerd-registry-mirror.yaml , and then monitor the DaemonSet status kubectl get daemonset containerd-registry-mirror -n kube-system

6. To double-check, ssh to the node and get the content of /etc/containerd/certs.d/docker.io/hosts.toml

7. If you need to setup default mirror for ALL registries, use the following path /etc/containerd/certs.d/_default/hosts.toml

Hope it’s helpful for someone faced with the same issue.

Got questions or need help? Drop a comment or share your experience

Have a smooth containerization!