Marius Schuller

devops | dungeon master | photographer


Graceful Termination of Celery Workers | Marius Schuller

Graceful Termination of Celery Workers

August 15, 2024

Table of Content

We were happily using celery for our distributed task needs and we realized that on new deployments our Docker containers were shut down with all the grace of a bull in a china shop. Tasks left unfinished for (time and resource) cost intensive retries, inconsistencies in the database - it was like someone pulled the plug mid-task, and our poor celery workers never saw it coming.

In case you need a refresher, celery is a widely used distributed task queue system in Python applications, enabling offloading time-consuming tasks to background workers. And it turns out when scaling and managing celery workers in Docker Swarm or Docker Compose, it’s essential to handle container termination gracefully.

To address this issues, we need to ensure that celery worker containers in Docker Swarm or Docker Compose are terminated gracefully, allowing them to finish their tasks before shutting down.

Three Ways to Shutdown Celery Worker Container Gracefully

We’ll get right to making sure our celery workers are shut down gracefully, and there are a few ways to do this actually.

1. Use Celery as the Main Executing Command

celery, by default, already handles the SIGTERM signal gracefully!

The thing is, we have to take care to not bury it under layers of shell scripts or supervisord - it’s like putting noise-canceling headphones on your workers. So one way to ensure that celery is the main executing command for the container, and the worker will automatically complete its tasks upon receiving the SIGTERM signal.

docker-compose.yaml:

services:
  celery-worker:
    image: myapp:latest
    command: ["celery", "--max-tasks-per-child=10", "--queues=my_important_queue", "-n", "important_queue@%n", "--concurrency=1"]

We tried this, but in our setup and due to the use of Docker Swarm we couldn’t just stop the container, we had to scale it down first because if we just stopped it our dear Swarm leader would just start it up again (restart: on-failure). So we had to go with a different approach. Let’s explore further.

2. Add Code to a Celery Task to Handle the Shutdown

As said above, celery handles SIGTERM by default, so another option is to implement a custom shutdown handler to perform actions, such as… waiting and not dieing until the SIGKILL from the orchestrator comes in.

from celery.signals import worker_shutdown

@worker_shutdown.connect
def on_worker_shutdown(**kwargs):
    # Perform custom shutdown actions, such as resource cleanup or notifications
    pass

I’m not the biggest pythonista, but I’m sure you can do some magic with this. You could even set up a centralized shutdown manager that listens to all workers and reports back how many tasks are already waiting for the SIGKILL.

Since I don’t have a code example for this, I’ll leave it to you to implement this in your setup.

As you may have guessed, we didn’t go with this approach either because we didn’t want to add that much complexity to our setup! But there’s one other way to handle this at least somewhat gracefully.

3. Use Signal Trapping

Since our container start script performs additional tasks and needed to wait after celery exits we couldn’t use exec. But we could use signal trapping. We captured the SIGTERM signal originating from Docker Swarm on a scale down event and forward it to the child process of celery. It’s like teaching your script to be a polite party host, making sure everyone knows when it’s time to leave.

Though for this it’s important to configure the Docker stop_grace_period in your docker-compose.yaml to give the worker container enough time to finish its tasks before being terminated (the default is 10s).

This is why after gracefully shutting down the celery worker, we sleep for 6 minutes to ensure that the worker doesn’t exit before the SIGKILL comes in. Otherwise it would be restarted by the orchestrator.

start.sh:

#!/usr/bin/env bash

# How to handle the SIGTERM signal
function terminate_handler() {
    echo "Caught SIGTERM signal!"

    echo "Terminating main Celery process with PID $1"
    kill -TERM "$1"
    echo "Graceful shutdown initiated. Sleeping after process termination."
}

# Start the Celery worker in the background
celery --max-tasks-per-child=10 --queues=my_important_queue -n important_queue@%n --concurrency=1 &
# Store the PID of the celery worker
celery_pid=$!

# Register the terminate_handler function to be called on SIGTERM
trap "terminate_handler '${celery_pid}'" SIGTERM

# Wait for the celery process to exit
wait $celery_pid
# Store the exit code of the celery process
error_code=$?

# A return code of 143 means the process was terminated by a SIGTERM signal, this is A-OK
echo "celery worker exited with $error_code - sleeping for 360 seconds"
sleep 360

Caveat: We know that 95% of our (cricital, inconsitency generating) workers finish their tasks within 5 minutes and we’re okay with an occasionally killed unfinished worker for now. So we set the grace period to 5 minutes. If you have long-running tasks, you might need to adjust this value.

docker-compose.yaml:

services:
  celery-worker:
    image: myapp:latest
    command: ["./start.sh"]
    stop_grace_period: 5m

Adaptability

These principles work for Docker Compose, Docker Swarm and equally well for Kubernetes. Because let’s face it, no matter where your containers live, they deserve a dignified exit strategy.

Remember, a happy celery worker is a productive worker. And a gracefully terminated worker?

Well, that’s just good manners.

References