azure – UseIT | Roman Levchenko

Recently, I have migrated our highly available Allure TestOps environment from Kubernetes (AWS) to an RPM-based setup running in on-premise environment. During the migration, I encountered several configuration problems related to task executors, Redis Sentinel with TLS, authentication handling, and RabbitMQ quorum queues.

Before diving into the problems, here is the architecture used in the target environment (on-premise):

RabbitMQ cluster (3 nodes, initially using classic queue; later migrated to quorum queues)
Redis Sentinel (3 nodes, TLS enabled, internal CA)
PostgreSQL (3 nodes, Patroni, dedicated etcd cluster with also 3 nodes)
Stateless Allure TestOps (3 “replicas”; 26.1.1 version and then 26.2.1.5)

Issue #1 — Spring Task Executor Initialization Failure

After the migration and startup of the new RPM-based Allure instance, the application failed during initialization with the following error:

org.springframework.beans.factory.UnsatisfiedDependencyException:
Error creating bean with name 'exportController'

...

Failed to instantiate
[org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor]:

Factory method 'taskExecutor' threw exception with message: null

The issue was related to changes in task executor logic introduced in the 26.1.1 release. The default thread pool configuration was no longer sufficient for our workload. Although existing executor settings were already present, additional pool sizing parameters became mandatory in practice.

Adding the following parameters resolved the startup issue:

			
ALLURE_TASKEXECUTOR_MAXPOOLSIZE=200
ALLURE_TASKEXECUTOR_QUEUECAPACITY=1000

Issue #2 — Redis Sentinel TLS Connection Failures

The next issue involved connecting Allure TestOps to Redis Sentinel with TLS enabled:

Caused by: io.lettuce.core.RedisConnectionException:
Cannot connect to Redis Sentinel at redis://1.redisdb.corp.net:26379

java.net.SocketException: Connection reset

The issue was caused by an incorrect Redis SSL property.

Initially configured:

SPRING_REDIS_SSL=true

However, Spring Data Redis implementations expect a different parameter when Redis Sentinel used:

SPRING_DATA_REDIS_SSL_ENABLED=true

After updating the property, the TLS connection to Redis Sentinel started working correctly.

Issue #3 — Redis Sentinel Authentication Problems

After fixing TLS, another Redis-related issue appeared during startup:

NOAUTH HELLO must be called with the client already authenticated.Alternatively, the HELLO <proto> AUTH <user> <pass> option can be usedto authenticate the client and select the RESP protocol version at the same time.

Authentication for Redis Sentinel itself was missing.

Even though the main Redis password was configured, Sentinel authentication requires its own dedicated parameter.

Adding the following parameter resolved the issue:

SPRING_DATA_REDIS_SENTINEL_PASSWORD=pass

And the final Redis configuration looked like this:

			
# Redis
SPRING_DATA_REDIS_SENTINEL_MASTER=redis
SPRING_DATA_REDIS_SENTINEL_NODES=1.redisdb.corp.net:26379,2.redisdb.corp.net:26379,3.redisdb.corp.net:26379
SPRING_DATA_REDIS_SENTINEL_PASSWORD=pass
SPRING_DATA_REDIS_PASSWORD=pass
SPRING_DATA_REDIS_SSL_ENABLED=true
SPRING_SESSION_STORE_TYPE=REDIS
SPRING_DATA_REDIS_DATABASE=0
ALLURE_REDIS_SESSIONTTL=10d

		

Issue #4 — RabbitMQ Quorum Queues Not Being Created

The last major issue appeared after upgrading to Allure TestOps 26.2.1.4.

This release introduced support for quorum queues in RabbitMQ.

RabbitMQ cluster was already configured with:

"default_queue_type": "quorum"

However, Allure continued creating classic queues.

Even though RabbitMQ supported quorum queues globally, Allure TestOps still required explicit quorum queue activation through application configuration.

The following parameters must be configured:

			
ALLURE_UPLOAD_QUORUM_ENABLED=true # Enables quorum queues for all declared queues.
ALLURE_UPLOAD_QUORUM_INITIALGROUPSIZE=3 # Defines the number of replicas for quorum queues.
ALLURE_UPLOAD_QUORUM_DELIVERYLIMIT=5 # Controls the maximum number of message redeliveries.

Bonus Observation: RabbitMQ Messages Stuck in Ready State

During the migration, I also encountered an unusual issue with RabbitMQ 4.3.1 and Allure TestOps 26.2.4 which occurred a few days after migrating to quorum queues. At this point, it’s unclear whether this was caused by a product bug, a RabbitMQ-specific behavior, or a configuration-related issue.

The environment had been operating normally with no signs of instability. However, on one occasion, approximately 4,000 messages accumulated in the Ready state and were not consumed by Allure TestOps. Perfect logs, no connectivity issues, no performance bottlenecks and etc.

The issue was resolved by simply restarting the Allure TestOps application instances, after which message consumption resumed immediately and the queue was processed successfully.

Since the problem occurred only once and has not been reproduced, I could not unable to determine the exact root cause. Teams running large-scale Allure TestOps deployments with RabbitMQ quorum queues may want to monitor queue consumer activity and message backlog metrics closely after upgrades.

I hope these undocumented findings will save someone hours or days during migration or upgrading Allure TestOps.

Recently, I encountered and resolved several issues that were causing job failures and instability in AWS-based GitLab Runner setup on Kubernetes. In this post, I’ll walk through the errors, their causes, and the solutions that worked for me.

Job Timeout While Waiting for Pod to Start

ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start

Adjusting poll_interval and poll_timeout helped resolve the issue:

poll_timeout (default: 180s) – The maximum time the runner waits before timing out while connecting to a newly created pod.
poll_interval (default: 3s) – How frequently the runner checks the pod’s status.

By increasing poll_timeout (180s to 360s), the runner allowed more time for the pod to start, preventing such failures.

ErrImagePull: pull QPS exceeded

When the GitLab Runner starts multiple jobs that require pulling the same images (e.g., for services and builders), it can exceed the kubelet’s default pull rate limits:

registryPullQPS (default: 5) – Limits the number of image pulls per second.
registryBurst (default: 10) – Allows temporary bursts above registryPullQPS.

Instead of modifying kubelet parameters, I resolved this issue by changing the runner’s image pull policy from always to if-not-present to prevent unnecessary pulls:

pull_policy = ["if-not-present"] # default one
allowed_pull_policies = ["always", "if-not-present"] # allow to set always from pipeline if necessary

TLS Error When Preparing Environment

ERROR: Job failed (system failure): prepare environment: setting up trapping scripts on emptyDir: error dialing backend: remote error: tls: internal error

GitLab Runner communicates securely with the Kubernetes API to create executor pods. A TLS failure can occur due to API slowness, network issues, or misconfigured certificates

Setting the feature flag FF_WAIT_FOR_POD_TO_BE_REACHABLE to true helped resolve the issue by ensuring that the runner waits until the pod is fully reachable before proceeding. This can be set in the GitLab Runner configuration:

[runners.feature_flags]
  FF_WAIT_FOR_POD_TO_BE_REACHABLE = true

DNS Timeouts

dial tcp: lookup on <coredns ip>:53: read udp i/o timeout

While CoreDNS logs and network communication appeared normal, there was an unexpected spike in DNS load after launching more GitLab jobs than usual.

Scaling the CoreDNS deployment resolved the issue. Ideally, enabling automatic DNS horizontal autoscaling is preferred for handling load variations (check out kubernetes docs or cloud provider’s specific solution: they all share the same approach – add more replicas if increased load occurs)

kubectl scale deployments -n kube-system coredns --replicas=4

If you encountered other GitLab Runner issues, share them in comments 🙂

Cheers!

Category: azure

How to fix Allure TestOps configuration errors (PostgreSQL, Redis, Spring, RabbitMQ)

Issue #1 — Spring Task Executor Initialization Failure

Issue #2 — Redis Sentinel TLS Connection Failures

Issue #3 — Redis Sentinel Authentication Problems

Issue #4 — RabbitMQ Quorum Queues Not Being Created

Issue #1 — Spring Task Executor Initialization Failure

Issue #2 — Redis Sentinel TLS Connection Failures

Issue #3 — Redis Sentinel Authentication Problems

Issue #4 — RabbitMQ Quorum Queues Not Being Created

Share this:

Job Timeout While Waiting for Pod to Start

ErrImagePull: pull QPS exceeded

TLS Error When Preparing Environment

DNS Timeouts

Share this: