Graceful Termination of Celery Workers
Table of Content
We were happily using celery
for our distributed task needs and we
realized that on new deployments our Docker containers were shut down with all the grace of a bull
in a china shop. Tasks left unfinished for (time and resource) cost intensive retries,
inconsistencies in the database - it was like someone pulled the plug mid-task, and our poor celery
workers never saw it coming.
In case you need a refresher, celery
is a widely used distributed task queue system in Python
applications, enabling offloading time-consuming tasks to background workers. And it turns out when
scaling and managing celery
workers in Docker Swarm or Docker Compose, it’s essential to handle
container termination gracefully.
To address this issues, we need to ensure that celery
worker containers in Docker Swarm or Docker
Compose are terminated gracefully, allowing them to finish their tasks before shutting down.
Three Ways to Shutdown Celery Worker Container Gracefully
We’ll get right to making sure our celery
workers are shut down gracefully, and there are a few ways
to do this actually.
1. Use Celery as the Main Executing Command
celery
, by default, already handles the SIGTERM signal gracefully!
The thing is, we have to take care to not bury it under layers of shell scripts or supervisord -
it’s like putting noise-canceling headphones on your workers. So one way to ensure that celery
is
the main executing command for the container, and the worker will automatically complete its tasks
upon receiving the SIGTERM signal.
docker-compose.yaml:
services:
celery-worker:
image: myapp:latest
command: ["celery", "--max-tasks-per-child=10", "--queues=my_important_queue", "-n", "important_queue@%n", "--concurrency=1"]
We tried this, but in our setup and due to the use of Docker Swarm we couldn’t just stop the
container, we had to scale it down first because if we just stopped it our dear Swarm leader would
just start it up again (restart: on-failure
). So we had to go with a different approach. Let’s
explore further.
2. Add Code to a Celery Task to Handle the Shutdown
As said above, celery
handles SIGTERM by default, so another option is to implement a custom
shutdown handler to perform actions, such as… waiting and not dieing until the SIGKILL
from
the orchestrator comes in.
from celery.signals import worker_shutdown
@worker_shutdown.connect
def on_worker_shutdown(**kwargs):
# Perform custom shutdown actions, such as resource cleanup or notifications
pass
I’m not the biggest pythonista, but I’m sure you can do some magic with this. You could even set up
a centralized shutdown manager that listens to all workers and reports back how many tasks are
already waiting for the SIGKILL
.
Since I don’t have a code example for this, I’ll leave it to you to implement this in your setup.
As you may have guessed, we didn’t go with this approach either because we didn’t want to add that much complexity to our setup! But there’s one other way to handle this at least somewhat gracefully.
3. Use Signal Trapping
Since our container start script performs additional tasks and needed to wait after celery
exits
we couldn’t use exec
. But we could use signal trapping. We captured the SIGTERM signal originating
from Docker Swarm on a scale down event and forward it to the child process of celery
. It’s like
teaching your script to be a polite party host, making sure everyone knows when it’s time to leave.
Though for this it’s important to configure the Docker stop_grace_period
in your
docker-compose.yaml
to give the worker container enough time to finish its tasks before being
terminated (the default is 10s
).
This is why after gracefully shutting down the celery
worker, we sleep for 6 minutes to ensure
that the worker doesn’t exit before the SIGKILL
comes in. Otherwise it would be restarted by the
orchestrator.
start.sh:
#!/usr/bin/env bash
# How to handle the SIGTERM signal
function terminate_handler() {
echo "Caught SIGTERM signal!"
echo "Terminating main Celery process with PID $1"
kill -TERM "$1"
echo "Graceful shutdown initiated. Sleeping after process termination."
}
# Start the Celery worker in the background
celery --max-tasks-per-child=10 --queues=my_important_queue -n important_queue@%n --concurrency=1 &
# Store the PID of the celery worker
celery_pid=$!
# Register the terminate_handler function to be called on SIGTERM
trap "terminate_handler '${celery_pid}'" SIGTERM
# Wait for the celery process to exit
wait $celery_pid
# Store the exit code of the celery process
error_code=$?
# A return code of 143 means the process was terminated by a SIGTERM signal, this is A-OK
echo "celery worker exited with $error_code - sleeping for 360 seconds"
sleep 360
Caveat: We know that 95% of our (cricital, inconsitency generating) workers finish their tasks within 5 minutes and we’re okay with an occasionally killed unfinished worker for now. So we set the grace period to 5 minutes. If you have long-running tasks, you might need to adjust this value.
docker-compose.yaml:
services:
celery-worker:
image: myapp:latest
command: ["./start.sh"]
stop_grace_period: 5m
Adaptability
These principles work for Docker Compose, Docker Swarm and equally well for Kubernetes. Because let’s face it, no matter where your containers live, they deserve a dignified exit strategy.
Remember, a happy celery
worker is a productive worker. And a gracefully terminated worker?
Well, that’s just good manners.