azure – UseIT | Roman Levchenko

Recently, I encountered and resolved several issues that were causing job failures and instability in AWS-based GitLab Runner setup on Kubernetes. In this post, I’ll walk through the errors, their causes, and the solutions that worked for me.

Job Timeout While Waiting for Pod to Start

ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start

Adjusting poll_interval and poll_timeout helped resolve the issue:

poll_timeout (default: 180s) – The maximum time the runner waits before timing out while connecting to a newly created pod.
poll_interval (default: 3s) – How frequently the runner checks the pod’s status.

By increasing poll_timeout (180s to 360s), the runner allowed more time for the pod to start, preventing such failures.

ErrImagePull: pull QPS exceeded

When the GitLab Runner starts multiple jobs that require pulling the same images (e.g., for services and builders), it can exceed the kubelet’s default pull rate limits:

registryPullQPS (default: 5) – Limits the number of image pulls per second.
registryBurst (default: 10) – Allows temporary bursts above registryPullQPS.

Instead of modifying kubelet parameters, I resolved this issue by changing the runner’s image pull policy from always to if-not-present to prevent unnecessary pulls:

pull_policy = ["if-not-present"] # default one
allowed_pull_policies = ["always", "if-not-present"] # allow to set always from pipeline if necessary

TLS Error When Preparing Environment

ERROR: Job failed (system failure): prepare environment: setting up trapping scripts on emptyDir: error dialing backend: remote error: tls: internal error

GitLab Runner communicates securely with the Kubernetes API to create executor pods. A TLS failure can occur due to API slowness, network issues, or misconfigured certificates

Setting the feature flag FF_WAIT_FOR_POD_TO_BE_REACHABLE to true helped resolve the issue by ensuring that the runner waits until the pod is fully reachable before proceeding. This can be set in the GitLab Runner configuration:

[runners.feature_flags]
  FF_WAIT_FOR_POD_TO_BE_REACHABLE = true

DNS Timeouts

dial tcp: lookup on <coredns ip>:53: read udp i/o timeout

While CoreDNS logs and network communication appeared normal, there was an unexpected spike in DNS load after launching more GitLab jobs than usual.

Scaling the CoreDNS deployment resolved the issue. Ideally, enabling automatic DNS horizontal autoscaling is preferred for handling load variations (check out kubernetes docs or cloud provider’s specific solution: they all share the same approach – add more replicas if increased load occurs)

kubectl scale deployments -n kube-system coredns --replicas=4

If you encountered other GitLab Runner issues, share them in comments 🙂

Cheers!

SSL management is always a pain. We should check SSL certificates periodically or implement a solution that carries all management tasks for us (let’s encrypt and cert-manager, for instance). And if there is an issue with a certificate, it’s a always a subject of downtime, so we have to find a solution as quickly as possible. Furthermore, all websites should meet requirements to complete tests and get a “green” mark from mozilla observatory or ssl shopper checker, for example. In this post, we’ll discuss possible issues you may face during the ssl check: “incorrect certificate chain” or “incorrect order. contains anchor”

Please note that my setup includes azure application gateway and azure kubernetes service. The following steps are general, however, may require using different certificate formats or signature algorithms. Check your environment’s requirements beforehand.

In my case, it was a wrong intermediate certificate provided by GoDaddy. So, I went to the godaddy site, clicked on certificate and copied intermediate certificate to cer file intermediate.cer

Make sure you have openssl on your computer and create a new pfx that contains a certificate, private key and intermediate certificate:
openssl pkcs12 -export -out appgw-cert.pfx -inkey .\pk.key -in .\ssl.crt -certfile .\intermediate.cer
If you have an old pfx with a valid certificate and key, do these commands:
openssl pkcs12 -in old.pfx -nocerts -nodes -out pk.key
openssl pkcs12 -in old.pfx -clcerts -nokeys -out cert.crt
openssl pkcs12 -export -out new.pfx -inkey .\pk.key -in .\cert.crt -certfile .\intermediate.cer
Type password for the pfx, and then update azure application gateway if needed:
$appGW = Get-AzApplicationGateway -Name "ApplicationGatewayName" –ResourceGroupName "ResourceGroupName"
$password = ConvertTo-SecureString $passwordPlainString -AsPlainText -Force
$cert = Set-AzApplicationGatewaySslCertificate -ApplicationGateway $AppGW -Name "CertName" -CertificateFile "D:\certname.pfx" -Password $password
Also, export pfx certificate to your personal certificate store and make sure that the correct chain is used or use ssllabs.com for already updated certificate.

..and finally my certificate is “green”

Category: azure

Application Gateway: Incorrect certificate chain or order