Caddy in a distributed setup (i.e. running on more than one machine) is a nightmare together with Let’s Encrypt ACME challenges… What happens is that Caddy A starts a challenge, which ends up at Caddy B and this can ping-pong for days before any certificates are given - and then only to a single instance, so the other instance will try for even longer. Of course the more of these you have the worse the problem gets.

See the discussions in https://github.com/psviderski/uncloud/issues/371 and https://github.com/psviderski/uncloud/issues/31#issuecomment-4484915485b.

This can be fixed inside of Uncloud, but it may be more elegant to not change any code (directly) and use Caddy extensibility and launch some kind of (distributed) storage inside the cluster that caddy then uses to store ACME challenges and certificates.

… Hence, caddy + a Redis storage backend (not particularly enamoured with Redis, but if it works, it works). This means deploying caddy and pointing it to Redis, in Uncloud’s case:

{
    storage redis {
        host unredis.internal
    }
}

In the global section of caddy, where unredis is the name of our deployment. That service definition is almost trivial.

services:
  unredis:
    image: redis:8.4-alpine
    command: redis-server --save 20 1 --loglevel warning
    volumes:
      - redis_data:/data

volumes:
  redis_data:

And the caddy image needs to build with that storage module enabled:

FROM caddy:2.11.3-builder AS builder
RUN xcaddy build \
    --with github.com/mholt/caddy-l4 \
    --with github.com/pberkel/caddy-storage-redis

FROM caddy:2.11.3
COPY --from=builder /usr/bin/caddy /usr/bin/caddy

Together this delivers almost instant issuing of TLS certificates, which is faster than waiting for days.