Provisioning Services

There is a large gap between folks running their own machines and provisioning those and folks running Kubernetes and having Gitops at their disposal. In the latter case you can just put some YAML in Git and apply it in Kubernetes. Folks running their own machines on the other hand… are not so lucky.

Several “solutions” have existed for years, the umbrella term being “provisioning systems”, cfengine (one of the first), puppet, ansible and more. These work, but are (too?) powerful in that they allow actions and edits on the system being provisioned. This creates (hidden) problems when you’re asking yourself questions like: “Is the system’s state correct and in sync?”; “Can we easily rollback to a previously working version?”; “What content should file /path/to/file actually have?”.

Somehow this gap was never addressed, maybe because folks just flocked to Kubernetes-like environments? Anyway I need it addressed (I think), so I’m trying to outline what features and design it needs here.

A couple of properties I want for a provisioning systems:

metrics, so rollouts/updates can be tracked;
diff detection, so we know state doesn’t reconcile;
out of band rollbacks;
no client side processing;
canarying.

That leaves a couple of things out of scope, like, the aforementioned, performing actions on a machine, which may mean this “new thing” is not 100% applicable for getting a machine up and running. OTOH using a customized OS image might be enough anyway.

Canarying is also not dealt with in this design doc, but may be shoehorned into it?

Various alternatives have been considered, but I’m not aware of any open source product that has the features and simplicity as described above.

Architecture⌗

The main design can be summed up as “use Git everywhere”. The server side is a public accessible, (read only) Git repository where the (generated) files live in sub directories for each server process. These files are used as-is; there is no further processing of these allowed.

The client essentially does a Git pull of this repository every so often and reloads (if needed) the server process.

On the client 3 operations are available per configured server process:

forward: roll over to the new release;
freeze: keep using the current version;
rollback: roll back to a previous version.

These are all implemented as git commands, except freeze which comes down to: “don’t run git pull”.

The client may also have a web interface that shows the current status of each server and allow for (remotely executed) rollbacks, but this is TBD, a nice *-ctl like tooling to poke it would be nice as well. Finding all clients is implemented by making this tool have access to the git repository where the config lives. Using this tool allows for out-of-band rollbacks.

The client exports the hash of the repository, this can be matched against the current hash of the main repo. Exporting this allows tracking of the rollout and diff detection.

The Server⌗

There is no server, there is only a Git repository, backed by Gitlab/Github, or just a repo.

The Client⌗

The client performs a sparse checkout of the Git repository with only the subdirectories it needs. ~~There is no need to have a lot of history so a `--depth 3` is used.~~ Then it keeps this repository up to date. Each client will check out the full repo, so every machine has the same history and rollback to previous commits work everywhere. The unbounded growth is concerning however.

Then for each server process a couple of bind mounts are set up that put the Git repository in the correct place so the server process can see the files. Performing bind mounts requires root access.

Upon changes an action can be perform, this action is currently limited to a systemd action.

On initial start a package can be installed as well.

Design⌗

When sketching out a configuration design, we get to the following. It needs to specify where where the Git repository lives, what package to install, directories and mount points we need, and what action (if any) we want:

[[services]]
upstream = "https://github.com/miekg/blah-origin"
machine = "grafana.atoom.net"
branch = "main"
service = "grafana-server"
user = "grafana"
package = "grafana"
action = "restart"
mount = "/tmp/grafana"
dirs = [
    { local = "/etc/grafana", link = "grafana/etc" },
    { local = "/var/lib/grafana/dashboards", link = "grafana/dashboards" }

With this the client will:

install the package if not already there;
use grafana/etc and grafana/dashboards to do a sparse check of the git repo (pointed to by upstream) to the checkout directory, under mount.
bindmount /etc/grafana and /var/lib/grafana/dashboards to the directories in the git repo.
start a routine to keep the git repo in check with its origin

So this means:

That repo is checkout in /tmp/grafana, so after the (sparse) checkout that looks like:

/mnt/grafana/grafana/etc
/mnt/grafana/grafana/dashboards

And then the following bind mounts are done:

/etc/grafana -> /mnt/grafana/grafana/etc
/var/lib/grafana/dashbords -> /mnt/grafana/grafana/dashboards

If there are changes systemctl reload grafana-server will be executed.

Notes⌗

The sparse checkout is an optimization detail and this design does not depend on it. How to bootstrap this is TBD.

Don’t know yet if indexing by hostname is sane, maybe also allow IP addresses to help bootstrap?

Things like /etc/passwd are already not fitting the model here, because for that too work I need to bind mount /etc which means all files in that directory need to exist in the remote Git repository. I.e. having it link to a (random?) subdirectory and mounting there should work. Does this need a config option? In Linux you can bind mount single files, so this might actually just work.