Simple ClickHouse backup agent

Learn how to backup and restore ClickHouse data

In today’s data-driven world, managing and protecting your data is paramount. We have already learned how to backup PostgreSQL here. And now it’s our turn to discuss ClickHouse, an open-source columnar database management system, that is renowned for it’s high performance and scalability.

I will explain a Dockerized cron job designed to facilitate the backup of ClickHouse databases, whether they are located on a single host or within a clustered environment with shards and replicas.

The agent is based on the Alpine Docker image and uses the clickhouse-backup tool, which has the ability to work as a REST API service. Logrotate has been added to manage the log files produced by the backup agent.

The agent does the following:

  • creates scheduled FULL or DIFF backups (POST to /backup/create)
  • checks “create backup” action status before every upload (GET to /backup/status)
  • uploads each backup to a remote storage (POST to /backup/upload/)
  • checks and waits until upload operation finishes (GET to /backup/actions)
  • manages log file with API responses and errors
  • generates customized output to standard container logs
  • if a backup is not uploaded to remote storage, it’s marked as failed and will not be used as the last backup for subsequent DIFF backups

Important: according to the clickhouse-backup official FAQ, “incremental backups calculate increment only during executing upload or create_remote command or similar REST API request”. In other words, DIFF and FULL local backups are actually the same (clickhouse-backup list local). Clickhouse-backup creates local backups first before uploading them to remote storage.

If you list remote backups using the command (clickhouse-backup list remote), you will notice the distinction between these two backup types. This is why the agent only issues a warning when you attempt to create a DIFF backup for the first time without having any prior FULL backups

Default settings:

  • DIFF backups: every hour from Monday through Friday and Sunday, plus every hour from 0 through 20 on Saturday
  • FULL backups: every Saturday at 8.30 PM
  • Rotate and compess logs weekly, rotated 14 times before being removed
  • Clickhouse-backup API basic authentication is enabled (rlAPIuser)
  • Clickhouse server authentication is enabled (rlbackup)
  • Remote storage is ftp with authentication enabled
  • Backups to keep local: 6
  • Backups to keep remote: 336

Content

  • docker-compose.yml – describes environment to test the agent locally.
    There are the following services:
    – clickhouse server (clickhouse-server:23.8-alpine)
    – clickhouse-backup (altinity/clickhouse-backup:2.4.0)
    – our clickhouse-backup-agent (ch-backup-agent)
    – ftpd_server (stilliard/pure-ftpd)
  • ./clickhouse/clickhouse-backup-config.yml – clickhouse-backup config file
  • ./agent/Dockerfile – backup agent’s docker image
  • ./agent/ch-backup-logrotate.conf – logrotate config file
  • ./agent/clickhouse-backup.sh – script to define backup and upload steps
  • ./agent/cronfile – cron job backup and logrotate tasks
  • ./github/workflows/docker-image.yml – simple GitHub action to build agent’s docker image on every Dockerfile change

Possible use cases

  • As a resource for learning Docker, Docker Compose, Bash, cron, and logrotate.
  • As a source of scripts, cron job tasks, or Docker files.
    Just grab them and you are all set.
  • As a sample of pairing ClickHouse-Backup and ClickHouse Server.

How to use

  • check out logrotate and cron settings in the agent folder
  • verify the Dockerfile in the agent folder (if docker is being used)
  • adjust clickhouse backup settings if necessary (./clickhouse/clickhouse-backup-config.yml) Change credentials, clickhouse host and remote storage at least
  • clickhouse-backup API container or standalone service shoud have access to /var/clickhouse/ folders to create backup successfully. In case of a container, see docker-compose.yml. If your clickhouse-backup API is a Linux service, run the service on the first replica for each shard, and then update cronfile accordingly.
  • copy cron and script files to a remote host, and then make a test run
  • in the case of using Docker, please check the docker-compose.yml file and remove any unnecessary services (such as clickhouse and ftp). Afterward, run docker-compose up -d –build to get containers started
  • use docker logs or docker compose logs to check service logs.
    Log files are also located under the /var/log/clickhouse-backup/ folder

Since the ch-backup-agent image already comes with the clickhouse-backup application pre-installed, you have the option to run both the API and scripts within the agent’s container. To achieve this, you should create a volume mapping for the ./clickhouse/clickhouse-backup-config.yml file and then start the clickhouse-backup API server as a service.

Some excerpts

Function to create full backup:

function create_full_backup {

    # CREATE
    curl -u "$API_USERNAME:$API_PASSWORD" -s -X POST \
          http://"${SERVER_ADDRESS}":7171/backup/create?name="${FULL_BACKUP_NAME}" | jq . >> "$BACKUP_LOG"
    BACKUP_NAME="${FULL_BACKUP_NAME}"
    check_backup_status

    # UPLOAD
    check_upload_status
}

Function to check upload status

function check_upload_status {

        # Catch In Progress status
        while [ "$(check_upload_req | jq -r .status )" == "in progress" ]; do
            echo -e "\n${GREEN}[INFO]${OFF} ${BOLD} Upload of the ${BACKUP_NAME} is still in progress...${OFF}"
            sleep 1m
        done

        # Catch Error status
        if [ "$(check_upload_req | jq -r .status )" == "error" ]; then
            echo -e "${RED}::::[ERROR]${OFF} ${BOLD} Couldn't upload the backup ${BACKUP_NAME}:${OFF}"
            {
            printf '\n%s\n' "UPLOAD ERROR:"
            check_upload_req | jq -r .error
            printf '%s\n' "-------------"
            } | tee -a "$BACKUP_LOG"
            return 1

        # Catch Success status
        elif [ "$(check_upload_req | jq -r .status)" == "success" ]; then
            echo -e "\n${GREEN}[INFO]${OFF} ${BOLD} The ${BACKUP_NAME} is now the last since it's just been uploaded successfully${OFF}"
            touch "${LAST_BACKUP_NAME}"
            echo -n "${BACKUP_NAME}" > "${LAST_BACKUP_NAME}"
        fi
}

Useful commands

While I have tested every command below, I strongly encourage you to learn about clickhouse-backup and its FAQ before proceeding with any restore operations. Your setup may have unique configurations and requirements!

Create a FULL backup manually

/backup/config/clickhouse-backup.sh create_full_backup clickhouse-backup-API-address APIUSER APIPASSWORD

In case you have a cluster, repeat the command with different API address

Create a DIFF backup manually

/backup/config/clickhouse-backup.sh create_diff_backup clickhouse-backup-API-address APIUSER APIPASSWORD

In case you have a cluster, repeat the command with different API address

Delete ALL REMOTE backups one-liner (if you have old ClickHouse version):

for rembackup in $(clickhouse-backup list remote | tail -n +2 | awk '{print $1}'); do 
     clickhouse-backup delete remote "$rembackup"
done

Delete ALL REMOTE backups one-liner:

for rembackup in $(clickhouse-backup list remote| awk '{print $1}'); do
    clickhouse-backup delete remote "$rembackup"
done

Delete ALL LOCAL backups one-liner:

for localbackup in $(clickhouse-backup list local | awk '{print $1}'); do
    clickhouse-backup delete local "$localbackup"
done

add tail -n +2 if your ClickHouse has old version

Restore a single table data:

clickhouse-backup restore <backup name> --data --table db.table_name_shard

If you’re getting “warn <table name> skipped cause system.replicas entry already exists and replication in progress from another replica logger=clickhouse”, try to run the previous command with CLICKHOUSE_CHECK_REPLICAS_BEFORE_ATTACH=0

If you need to restore schema, use –schema parameter and –rm to drop existing schema objects: clickhouse-backup restore –rm<backup name> –schema or delete replica metadata in Zookeeper beforehand.

Result

ClickHouse Backup agent First Run
Noticed the warning? It’s not critical. Just FYI
Clickhouse Backup agent, Errors and warnings
A sample output with error, warning and info messages
ClickHouse Backup Agent Log Files
This is how the log file appears. All you need – errors and requests’ output
ClickHouse Backup agent DIFF backup create action
DIFF backup created successfully, therefore the last backup name had been updated as well

Playing with Kubernetes running on Azure (AKS)

Heptio (it’s founders co-created Kubernetes) polled about 400 IT decision makers from different sectors and company sizes to find out whether they use Kubernetes or not, understand obstacles and overall experience. About 60% of respondents are using Kubernetes today, and 65% expect to be using the technology in the next year. More surprisingly, about 77% of companies with more than 1000 developers that run Kubernetes are using it in production. 

Furthermore, VMware has recently announced the Project Pacific  that completely rebuilds vSphere architecture. Kubernetes is going to be a control plane in future vSphere versions. Sounds amazing, doesn’t it? 

The supervisor cluster is a Kubernetes cluster of ESXi instead of Linux

I hope you have warmed up and taken an interest in something that may not have been familiar with – Containers and Kubernetes. I believe so and recommend to read about Kubernetes and Docker concepts before we get started.

We’re gonna do some tasks in Azure Kubernetes Service (managed Kubernetes service in Azure) to help you to dive into Kubernetes and also get hands on experience with related services such as Container Registry, AKS, Terraform and Docker. 

Tasks: 
Notes:
  • This GitHub Repo includes everything covered in this blog post 
  • This lab uses a custom and simple ASP.NET Core web application that we will deploy to, and then publish by using Kubernetes (K8S)
  • Docker multistage image build packs up the application
  • Azure Container Registry stores the Docker image
  • Terraform automates deployment of Azure Kubernetes Service and Azure Container Registry. Scripts are stored at the different repo
  • Azure Kubernetes Services provides a managed Kubernetes master node in the cloud with ability to scale up worker nodes. AKS will host our PODs (roughly speaking, PODs represent process/containers running on  k8s cluster)
  • Azure CLI, PowerShell, docker, terraform, kubectl  (command line interface for running commands against Kubernetes clusters) are the main tools for completing tasks. Make sure you have them installed on your machine or use Azure Cloud Shell instead.
  • Create a Service Principal beforehand (Conributor role)
  • Azure DevOps is used for CI/CD (optional)

Deploy Kubernetes cluster in Azure

Although, we can use Azure Portal, CLI or PowerShell to deploy new Azure services, using Infrastructure as Code (IaC) approach is more visionary. We will use Terraform, so check out my repo and comments inside . Terraform creates Azure Container Registry:

#Get RG (create a new one if necessary by using "resource azurerm..")
data "azurerm_resource_group" "Rg" {
  name = "kubRg"
}

............

#Create a container registry
resource "azurerm_container_registry" "cr" {
  name                = "cr${random_string.randomName.result}"
  resource_group_name = "${data.azurerm_resource_group.Rg.name}"
  location            = "${data.azurerm_resource_group.Rg.location}"
  admin_enabled       = true
  sku                 = "Basic"
  # Only for classic SKU (deprecated)
  # storage_account_id  = "${azurerm_storage_account.storacc.id}" (Classic)
}
..............

Configures Kubernetes cluster with Azure Container Network Interface (CNI) to allow you to access PODs directly as every POD gets an IP from Azure subnet rather than use kubenet . At the end of configuration file, terraform enables K8S RBAC (it’s disabled by default in Azure) that we’ll use later during the service accounts creation.

resource "azurerm_kubernetes_cluster" "k8sClu" {
  name                = "rlk8sclu-${random_string.randomName.result}"
  location            = "${data.azurerm_resource_group.Rg.location}"
  resource_group_name = "${data.azurerm_resource_group.Rg.name}"
  dns_prefix          = "${var.dnsPrefix}"

  .......

  network_profile {
    network_plugin = "azure"
  }

  role_based_access_control {
    enabled = true
  }
  .........

Apply the configuration, and then check the output (in my case, resources have been already deployed, so nothing to add). Note ACR and AKS resource names (/managedClusters/…; registries/…)

For an additional deployment check , open up the Azure Cloud Shell or Azure CLI and type the following to open Kubernetes portal:

#Get Azure AKS Credentials
az login
az aks get-credentials --resource-group kubRg --name rlk8sclu-l3y5

#Open K8S dashboard
kubectl create clusterrolebinding kubernetes-dashboard --clusterrole=cluster-admin --serviceaccount=kube-system:kubernetes-dashboard
az aks browse --resource-group kubRg --name rlk8sclu-l3y5

#View kubectl config (optional, %HOMEPATH%/.kube/config)
kubectl config view

If your browser shows up a new page, likely there aren’t any issues with the deployment. Let’s jump into the second task.

TIP: get addresses of the master and services by running kubectl cluster-info

Make Docker image with the application

Let’s create a docker image with the application, and then push the image to the Azure Container Registry. The Dockerfile is located at the root of the aspnetapp folder (check out my repo) and describes multi-stage image build process. There is also the Dockerignore file to define folders that must be excluded from the image.

Run the Docker CLI and build the image (docker build <dir>):

Push the image to Azure Container Registry:

az acr login --name acrName
docker login acrFQDN
docker tag aspnetapp acrFQDN/aspnetapp
docker push acrFQND/aspnetapp

TIP: get the attributes of the image by running az acr repository show -n crl3y5 –image aspnetapp:latest

Make two fully isolated namespaces within the Kubernetes cluster

Once the docker image is uploaded to ACR, we are ready to proceed with the Kubernetes tasks. When you need something to change in K8S you may use kubectl to define operations sequentially or manifests files (yaml) that can describe multiple requests to K8S API Server in the declarative form.

If you look at my repo, you can see two folders ns1 and ns2 that store yaml/manifest files for respective namespaces. We’ll use that files in conjunction with kubectl to make some changes on the AKS cluster. Because manifests files are almost the same, only manifests for NS1 will be shown.

#Apply the manifest (example)
kubectl apply -f pathToTheManifestFile

Create a new namespace:

#Create a namespace 1
apiVersion: v1
kind: Namespace
metadata:
  name: ns1
  labels:
    author: rlevchenko

To deny ingress traffic from PODs running in the other namespaces:

#NS1
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: deny-from-other-namespaces
  namespace: ns1
spec:
  podSelector:
    matchLabels: {}
  ingress:
    - from:
        - podSelector: {}

TIP: use kubectl get namespaces to list namespaces and kubectl get networkpolicy -n ns1 to get the policy

Configure anti-affinity for PODs

To make sure that group of PODs (labelSelector section) in the cluster running on particular nodes , we need to configure affinity/anti-affinity rules. This anti-affinity “rule” ensures that each POD with app=aspcore label does not co-locate on a single node.

.....
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - aspcore
              topologyKey: "kubernetes.io/hostname"
......

TIP: use kubectl get pods -o wide -n ns1 to get info about assigned nodes, and then kubectl get nodes –show-labels to check node labels

Configure network policy to deny egress traffic from PODs (except DNS requests)

This task shows how you can filter network traffic from PODs in the namespace. All PODs with the app=aspcore label in the first namespace can make only DNS requests (out), other ones will be denied.

#Deny all traffic (except DNS) from PODs
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-external-egress
  namespace: ns1
spec:
  podSelector:
    matchLabels:
      app: aspcore
  policyTypes:
    - Egress
  egress:
    # allow DNS TCP/UDP 53 ports
    - ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP
    - to:
        - namespaceSelector: {}

TIP: get list of the network policies kubectl get networkpolicy -n ns1

Create a service account with read permission on PODs in the first namespace

A service account provides an identity for processes that run in a Pod. This except of manifest file describes a service account read-sa-ns that has read only permissions on PODs in the NS1 namespace (the rules section/verbs). Also, note that rbac role is used which we have enabled during applying the terraform configuration.

#New SA - ns level
apiVersion: v1
kind: ServiceAccount
metadata:
  name: read-sa-ns
---
#New Role - ns level
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: read-only-ns
  namespace: ns1
rules:
  - apiGroups: ["", "extensions", "apps"]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
---
#Binding the role to the sa -NS1
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-ns1-pods
  namespace: ns1
subjects:
  - kind: ServiceAccount
    name: read-sa-ns
    apiGroup: ""
    namespace: default
roleRef:
  kind: Role
  name: read-only-ns
  apiGroup: rbac.authorization.k8s.io

TIP: get roles in the NS1 namespace kubectl get role -n ns1 , and then check service accounts in K8S cluster kubectl get serviceaccounts –all-namespaces

Set CPU and RAM limits for each pod

If a container is created in the ns1 namespace, and the container does not specify its own values for memory request and memory limit, the container is given a default memory request of 128 MiB and a default memory limit of 400 MiB. In addition, you can define limits on the PODs level.

#Define mem-cpu limits
apiVersion: v1
kind: LimitRange
metadata:
  name: mem-cpu-limit-range
  namespace: ns1
spec:
  limits:
  - default:
      memory: 256Mi
      cpu: "400M"
    defaultRequest:
      memory: 128Mi
      cpu: "200M"
    type: Container

TIP: check the limits by running kubectl describe pod podname -n ns1

Configure PODs scalability based on CPU metric

Kubernetes allows you to automatically scale PODs based on the CPU/RAM metrics (horizontal pod autoscaler). If CPU average utilization is equal or greater than 70%, K8S deploys additional replicas (spec stenza, maxReplicas).

#Scale pods automatically (cpu metric)
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: aspcore-load-cpu
  namespace: ns1
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: asp-deployment1
  minReplicas: 1
  maxReplicas: 2
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 70

TIP: to list the NS limits kubectl describe namespace ns1

Publish the application

Now it’s time to publish the application running on a container. First, create a POD that will use our docker image that we’ve already pushed to the Azure Container Registry. A one Pod with the latest image will be created under the ns1 namespace. Check labels (one of the most important things, actually in K8S 🙂 ) , pod name and number (replicas)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: asp-deployment1
  namespace: ns1
  labels:
    app: web
    release: stable
spec:
  replicas: 1
  selector:
    matchLabels:
      app: aspcore
  template:
    metadata:
      labels:
        app: aspcore
    spec:
      containers:
        - name: web-aspcore
          image: crl3y5.azurecr.io/aspnetapp:latest

TIP: use kubectl get pods -n ns1 -o wide to check the pod state in the ns1

If the Pod’s status is running, you can publish it via the LoadBalancer service:

#Publish the deployment through the Service

apiVersion: v1
kind: Service
metadata:
  name: demo-service1
  namespace: ns1
spec:
  selector:
    app: aspcore
  type: LoadBalancer
  ports:
    - name: name-of-the-port
      port: 80
      targetPort: 80

Then check the deployment status, and get public ip of the service:

#Get deployments in the NS1
kubectl get deployments -n ns1

#Get Service's Public IP
kubectl get service -n ns1 -o jsonpath='{.items[].status.loadBalancer.ingress[0].ip}'

Open up the browser and navigate to http://publicip/api/values to verify that application is published and works:

What’s next?

Complete the following homework tasks to boost your skills:

  • Make a test POD with a static volume (use Azure storage)
  • Make sure that PODs are running under non-root account
  • Create a service account with read permission on all PODs in the cluster
  • Add context details about service accounts to your configuration file (kubeconfig), and then verify service accounts permissions
  • Configure PODs scalability based on RAM or network metrics
  • Check yourself – the answers are at my repo
  • Create CI/CD Azure DevOps pipelines to automate docker image build and environment deployment (use terraform/kubectl/docker)

Thanks for reading, stars and commits!