k8s – UseIT | Roman Levchenko

Recently, I encountered and resolved several issues that were causing job failures and instability in AWS-based GitLab Runner setup on Kubernetes. In this post, I’ll walk through the errors, their causes, and the solutions that worked for me.

Job Timeout While Waiting for Pod to Start

ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start

Adjusting poll_interval and poll_timeout helped resolve the issue:

poll_timeout (default: 180s) – The maximum time the runner waits before timing out while connecting to a newly created pod.
poll_interval (default: 3s) – How frequently the runner checks the pod’s status.

By increasing poll_timeout (180s to 360s), the runner allowed more time for the pod to start, preventing such failures.

ErrImagePull: pull QPS exceeded

When the GitLab Runner starts multiple jobs that require pulling the same images (e.g., for services and builders), it can exceed the kubelet’s default pull rate limits:

registryPullQPS (default: 5) – Limits the number of image pulls per second.
registryBurst (default: 10) – Allows temporary bursts above registryPullQPS.

Instead of modifying kubelet parameters, I resolved this issue by changing the runner’s image pull policy from always to if-not-present to prevent unnecessary pulls:

pull_policy = ["if-not-present"] # default one
allowed_pull_policies = ["always", "if-not-present"] # allow to set always from pipeline if necessary

TLS Error When Preparing Environment

ERROR: Job failed (system failure): prepare environment: setting up trapping scripts on emptyDir: error dialing backend: remote error: tls: internal error

GitLab Runner communicates securely with the Kubernetes API to create executor pods. A TLS failure can occur due to API slowness, network issues, or misconfigured certificates

Setting the feature flag FF_WAIT_FOR_POD_TO_BE_REACHABLE to true helped resolve the issue by ensuring that the runner waits until the pod is fully reachable before proceeding. This can be set in the GitLab Runner configuration:

[runners.feature_flags]
  FF_WAIT_FOR_POD_TO_BE_REACHABLE = true

DNS Timeouts

dial tcp: lookup on <coredns ip>:53: read udp i/o timeout

While CoreDNS logs and network communication appeared normal, there was an unexpected spike in DNS load after launching more GitLab jobs than usual.

Scaling the CoreDNS deployment resolved the issue. Ideally, enabling automatic DNS horizontal autoscaling is preferred for handling load variations (check out kubernetes docs or cloud provider’s specific solution: they all share the same approach – add more replicas if increased load occurs)

kubectl scale deployments -n kube-system coredns --replicas=4

If you encountered other GitLab Runner issues, share them in comments 🙂

Cheers!

My Argo CD is running on AWS EKS and is exposed via a standard Kubernetes Ingress (traefik class), meaning it also interacts with AWS ELB. Additionally, the server.insecure parameter in the Argo CD server is set to “true” (configured in the argocd-cmd-params-cm ConfigMap in Kubernetes), with TLS termination happening on the ingress side.

There are no issues with the Argo CD UI. However, I am unable to access the Argo CD API using simple curl requests or the Argo CD CLI. I keep receiving errors related to content-type and TLS handshake failures:

 argocd login argo.example.com --grpc-web --insecure --skip-test-tls

 FATA[0036] rpc error: code = Unknown desc = Post "https://argocd.example.com/session.SessionService/Create": net/http: TLS handshake timeout

FATA[0045] rpc error: code = Unimplemented desc = unexpected HTTP status code received from server: 404 (Not Found); transport: received unexpected content-type "text/plain; charset=utf-8"

All requests are being sent from a WSL instance (Ubuntu 22.04). Note that I have no issues accessing the API when using port forwarding or when connecting from the management partition (local machine).

I was about to give up, but then I decided to check the MTU size.

Get-NetIPInterface -AddressFamily IPv4 | Sort-Object -Property NlMtu | Select ifIndex, InterfaceAlias, NlMtu -first 5

ifIndex InterfaceAlias            NlMtu
------- --------------            -----
     19 Ethernet 3                 1392
     20 Local Area Connection* 1   1500
      8 Ethernet (WSL)             1500

Ethernet 3 is my VPN interface, and the API is only reachable through this interface.

Ethernet is the interface that WSL uses, so an MTU mismatch is occurring.

The solution is to adjust the MTU to match 1392 (the exact value may vary).

In your WSL instance, run the following:

ifconfig # to list interfaces. note your main interface (eth0 typically)
sudo ifconfig eth0 mtu 1392 # to change MTU size

I hope it helps!