switch-fix.nix

May 6, 2025

tldr; switch-fix.nix lets you set an automatic rollback to current generation / profile on nixos-rebuild [switch | boot] unless you cancel with cancel-rollback from a terminal within a set amount of time

Have you ever been messing with the network config of a nixos install on a physical machine you don't have physical access to, and started sweating bullets wondering if a nixos-rebuild was going to come back up? (There's another blog post in here about why bare-metal at a colo datacenter is way better than a cloud provider, hopefully I'll get to that in the future.) (Also up ahead should be how I do remote bare-metal installs with a custom iso via tailscale / headscale and safe ACLs.) This comes up if I manage an on-prem bare-metal install for a client, particularly one that has some more involved networking -- perhaps a vm host with bridged networking for vm guests.

switch-fix.nix is a set of commands to handle this situation; you set-rollback before you do your nixos-rebuild switch or nixos-rebuild boot (probably the latter, I like to make sure it comes back up after a full reboot). This links the target of /run/current-system to a rollback directory. A systemd service checks for this rollback directory and link on sysinit.target (instead of multi-user.target because maybe it never gets there with whatever garbage I made of the new config); if it's found, then a configurable timer starts counting down. Unless cancel-rollback is called before the timer is up, then the systemd service sets the stored profile via switch-to-configuration boot, and restarts the system.

The way I usually use it is, I'll first nix build the target profile on my build machine, then nix copy the resulting closure to the target machine. I ssh into the target machine, set-rollback, then nixos-rebuild --target-host [target machine] boot from the build machine. I manually trigger a reboot, then ping the target till it comes back up. Once it's up, I ssh [target-machine] cancel-rollback, then check to see if it succeeded or broke.

If that sounds a bit too manual... yeah, probably. I haven't gotten around to bundling this into a proper program, and rely a bit on really good tmux / fzf integration plus ridiculous zsh command history.

Before this, I used deploy-rs. At the time, at least, I couldn't get it to do rollbacks on boot, only on switch, and sometimes the latter would still be wonky. (This is also why the module is switch-fix, and I found that I would have to do a manual nix-env call after deploy to make sure things stuck -- this is probably fixed by now.) Maybe that project has gotten more features and stability, probably worth looking at.

In looking at the code for the first time in some years, I realize there could be many improvements; maybe an actual systemd timer instead of sleep, using wall is sometimes flaky, etc. But, seems to get the job done. When I have time I'll do some more testing to make sure functionality is there, but this one is something I use on a pretty regular basis.

Good luck!

https://blog.femtodata.com/posts/feed.xml