max’s place
who’s max? atom feed this site is best viewed without JavaScript

The Great Nix Flake Check

let’s start with some context: I’m currently writing unflake, which is essentialy a userspace reimplementation of nix flakes. to do that, I need some understanding of what flakes are: I want my implementation to be compatible with the upstream one. unfortunately, flakes don’t have a specification, are barely documented and upstream doesn’t really know how they work either (which is a big part of the reason why they’re still unstable).

so, naturally, I downloaded every flake I could get my hands on to run some tests and find potential incompatibilities.

while I was primarily interested in incompatibilities in unflake, I also ran every test with both CppNix and Lix, so I found some differences there too.

this post is pretty long and is composed of several parts:

for reference, here’re the versions of software I tested:

for every test, I removed flake.lock and ran either nix flake lock or unflake to generate a new lockfile. then I tried to evaluate (via nix-instantiate --eval) some well-known outputs to see if anything broke.

results summary

first, to give a sense of scale: during the run, I ran tests on 7615 different flakes spread across 5380 repos, and evaluated 43697 flake outputs.

here’s a basic summary of test results:

a graph visualizing overall summary of test results.
native resolver on both CppNix and Lix has over 65% full successes,
while unflake has about 55%. most of the failed inputs are marked as “bad”,
while some are partial failures. all implementations have similar amounts of cases where
“some outputs failed”, but “all outputs failed” is noticeably higher for unflake.
there’re also some cases where there were no matching outputs, shown separately.
this is the smallest category at about just a couple percent.

note that there’re two kinds of failure: for some flakes evaluation failed at the top level, so test runner wasn’t able to collect output names, but for some flakes only some of the outputs failed. expectedly, CppNix’s native resolver was the most compatible with the flakes in the wild, at ~70% full successes (out of flakes with at least one tested output), with Lix’s native resolver being close second at ~68% full successes. these numbers might seem low, but note that a) a lot of the flakes in the sample were test data of some kind, and b) flakes might depend on external resources which are not available anymore. some failures are also attributable to deficiencies in my testing setup, see below for more info on that.

unflake performed worse, with ~57–59% full successes. thankfully, most of the failures are attributable to specific missed features, discussed below. finding these missing features was the goal of this experiment, and in this regard it was wildly successful. fixing these is on my roadmap now, and I’ll probably do at least a partial test rerun after that.

it’s also interesting to compare how much inputs there were in locked flakes. native resolver and unflake have different approach to locking: native resolver respects lockfiles found in dependency repos, while unflake automatically unifies every (transitive) dependency with the same specification. therefore, I expected to see less duplication in unflake locks. however,

a histogram of unflake dep count / native dep count. it shows a huge spike on 1, reaching above 70%, but otherwise the right side of the graph (where unflake loses) looks slightly higher than the left side, with 1.2 being the second highest value at about 8%

I found that for most repos there was on average more duplication with unflake. how come? the reason for that problem is that unflake doesn’t (yet) respect input overrides. many repos have different “versions” of nixpkgs somewhere in their dependency graph (e.g. nixos-unstable, nixpkgs-unstable, master and specific nixos-* versions). unflake faithfully considers each of these to be distinct, which creates duplication. there would be even more duplication with the native resolver, of course, have these inputs not been overriden via .inputs.nixpkgs.follows = "nixpkgs" or similar. after implementing this feature, unflake should be consistently better than native resolver.

interestingly, the average amount of dependencies was still higher for the native resolver (~5.3 vs ~5).

(note that unflake already provides dedup rules to deal with this issue, but they were not used in this test)

future work

there’re a few more things I’m planning to do with this project. first, of course, after fixing the problems I found in unflake, I’ll re-use this infrastructure to check if they were truly fixed. there’s also some more space for data analysis and improving the test suite here:

  1. the test suite, in principle, records locking and evaluation times. they’re not analyzed in this post, because that requires a “clean” run where 100% of requests hit cache. otherwise, tests that were run later have an advantage, because they have a higher cache hit rate.

  2. every produced lock file is recorded, which allows to search for inconsistencies between CppNix and Lix. this should be totally possible with the data I have, I just didn’t get around to doing that.

  3. text representation of every evaluated output is hashed, and its hash is recorded. at a cursory glance, this data seems useless, because all the hashes turned out to be different. this is somewhat weird, and changing what is hashed (or even recording raw output texts) might give more data to future analysis.

  4. nix-instantiate --eval (used to evaluate outputs) does non-strict evaluation by default. it might be worthwhile to try using --strict to possibly catch more failures. this might drastically slow things down though.

  5. I failed to achieve high CPU utilization even with good cache hit rates, and I’m not sure why. fixing that might make test runs much faster, which would allow to iterate more.

conclusions and call to action

while unflake’s results in this tests were less than ideal, I’m now even more sure that externally implementing flakes is feasible. the causes for most incompatibilities are now known, and fixing them doesn’t seem impossible.

there’s one thing that hinders unflake and native implementations alike, though: no one knows what the flakes are. documentation is lacking, and knowledge either spreads by word of mouth or is obtained by reading the source code. if you’re interested, you can read the detailed compatibility analysis below, and maybe learn about some fun flakes features you haven’t seen before (I sure did!).

I propose, to anyone who implements flakes (but especially CppNix team):

  1. adopt an infrastructure for similar large-scale test runs. in absence of documentation, real-world usage is our only gauge of what flakes are. if we are to stabilize flakes, this seems to be a useful tool to define what are we stabilizing. Lix has a project to run its parser on various open-source Nix code, which is a great start, but we need to test more.

  2. pause implementing new features in flakes until we document existing ones. feature freeze will give us time to write documentation, consider design decisions, and run tests.

  3. write specification for flakes and associated APIs (like flakerefs and fetchTree). I’m ready to work on this project with all the knowledge I have accumulated.

I strongly believe that having a fixed, documented understanding of what flakes are will allow us both to stabilize native implementations, and to write new ones, with different features, APIs and tradeoffs. on the other hand, if we don’t take action, the definition of flakes will continue to drift. that is a problem even if we only consider a single implementation: without specification, different versions of the same implementation will become less and less compatible with each other. this is already happening: there’re flake.nix files in the wild that will only work on a subset of CppNix versions.

with common understanding of APIs and guarantees we’ll be free to innovate on the implementations. flakes provide a useful dependency specification format, but we don’t have to carry all the associated weirdness forever.

Carthago delenda est.

compatibility issues

(note that a lot of the numbers in these sections are calculated semi-manually, so there might be minor inaccuracies. that shouldn’t affect the overall picture though)

as my primary goal was to find incompatibilities between various flake implementations, let’s talk about that first. turns out, there’re a lot of weird and cursed things people do with flakes.

where unflake fails

this was the original goal for this research. as unflake’s dependency injection code is fully independent of the upstream code, it was to be expected that my implementation would miss some features. it is a goal for unflake to be fully compatible with all the real-world flakes, but we’re not quite there yet.

missing attributes

this was, by far, the largest problem, with 1131 tests failing because of that. when flakes are passed as inputs to other flakes (or to themselves as self), they have some additional attributes set. some of them are documented, but in addition to that there’re also inputs, outputs, sourceInfo, and _type = "flake". the version of unflake used in this run didn’t support inputs, outputs, _type and lastModifiedDate (but overwhelming majority of failures were because of inputs). it also doesn’t set sourceInfo for the root package, because it didn’t actually fetch it.

interestingly, there’re some libraries that actually check for _type = "flake": notably, flake-parts does so. I am not really sure what the reason for that is, the commit message only mentions that making this not an error would “cause problems”.

this problem is now largely fixed, with one remaining missing feature being sourceInfo and outPath attributes on self.

relative path inputs

multiple flakes located in the same repo can use relative paths to refer to each other, like path:../... unflake doesn’t currently support that (when I was first writing unflake, I was unaware that is even an option). it was quite a common problem with 106 total test failures, so fixing that is a priority (tracked in #54).

input overrides

this was a pretty stupid bug. unflake doesn’t currently support input overrides (the thing where you write inputs.foo.inputs.bar = ...). it is planned, but due to the fact that unflake does dependency unification anyway, shouldn’t have been a huge problem. unfortunately, in code like

{
  type = "github";
  owner = "meow";
  repo = "meow";
  inputs.nixpkgs.follows = "nixpkgs";
}

inputs was considered to be a part of the flake reference, and the reference rejected as invalid. this caused 93 failed tests. this bug is now fixed.

file or tarball?

pop quiz: for a string-style flakeref https://example.com/meow, what is its type? a) file; b) tarball; c) impossible to know and a sin to ask.

turns out flakeref parsing for http urls & friends depends on whether the input is a flake or not. I cannot properly express how wild that is. not only is this, as far as I’m aware, totally undocumented, it also undermines flakerefs as a format by making them impossible to parse.

as far as I am aware, the full logic is as follows:

  1. if the input is a flake, the type will be tarball;
  2. else, if the url ends with a known archive extension, the type will be tarball;
  3. else the type will be file.

unflake uses builtins.parseFlakeRef for parsing flake refs, which always behaves as-if the input is a flake. this caused problems for 50 tests that used flake = false type inputs with this syntax. fixing this is tracked in #55.

follows empty

did you know you can specify

inputs.meow.follows = "";

and it will pass self instead of this input? I didn’t. apparently this trick is used to make overrideable inputs without actually downloading any extra dependencies. 26 tests failed because of using this trick.

implementing this is tracked in #56.

implicit inputs

there’s this feature in flakes that allows you to omit input specification for flakes that are in flake registry. you could do this:

{
  inputs = {};
  outputs = { nixpkgs, ... }: {};
}

or this:

flake-parts.flake = false;

or this:

nixpkgs = {};

(all real examples).

at the time of the run, unflake only supported the first of these patterns. I’d like to argue that allowing this is a misfeature, and public flakes shouldn’t rely on registries regardless, but, unfortunately, a lot of high-profile flakes seem to disagree, and this led to 47 test failures, so I caved in and implemented this.

actually, while we’re at it, a bunch of repos had an implicit input for nixpkgs or some other popular flake. this is usually a mistake and can lead to local paths leaking in lockfiles or resolver picking up system nixpkgs when it was meant to be locked. I consider it to be a deficiency of flakes, which is unfortunately unsolvable in unflake if we want to retain compatibility.

.outPath on inputs.self

while unflake diligently sets .outPath for every flake input, it doesn’t do so for the root project, as it doesn’t cause your code to be copied to the store. this is a relatively small problem, because you can mostly work around it in your own code. a project using unflake would just need to replace their own usage of inputs.self (or, equivalently, toString self, or "${self}") with ./.. that is, unless one of your dependencies manages your inputs for you, like numtide/blueprint or hercules-ci/flake-parts. it cannot find your source code using relative paths, so it has to rely on inputs.self.outPath. in total that caused 89 test failures, most of them originating from either blueprint or `flake-parts.

supporting this is tracked in #58. you can still pass ./. as the root path explicitly as a workaround meanwhile.

inputs with both ref and rev

did you know you can override ref or rev for registry inputs?

inputs.nixpkgs.url = "nixpkgs/bae1bd10c9c57b2cf517953ab70060a828ee6f";

unflake supports this, but there was a bug in implementation: when the resulting input has type github (or gitlab, or sourcehut), we need to ensure that only one of ref and rev is set, otherwise evaluator hits an assertion and crashes with SIGABRT. this caused 23 test failures.

this bug was fixed in #59 and I sent a fix to Lix to produce a proper error instead of crashing. CppNix also has a fix now.

smaller stuff

where unflake (kinda) wins

that also happened a couple times! in some cases unflake was able to resolve dependencies and produce a working lockfile, while native resolver (with both CppNix and Lix) wasn’t. none of these are really important though, it’s just funny accidents and I include them just for completeness sake.

where Lix fails

fetcher busy

this technically only reproduces when using unflake, but it’s caused by a Lix bug, so I’m grouping it there. if you run a lot of fetchers concurrently, sometimes some of them fail to lock an SQLite database and crash with “fetcher busy”. I seem to remember it was fixed in CppNix, but it still affects Lix, which caused 76 test failures. this is a known issue, and a potential fix is on the way.

deprecated features

not really a bug. Lix deprecated and removed some features, so flakes that were using them failed. this caused 332 test failures for using URL literals and 1 failure for using CR line endings.

paths in flake input urls

in newer versions of CppNix you can use path literals in flake inputs, like so:

{
  inputs.foo.url = ./foo;
}

this was implemented after Lix was forked, so it’s not (yet?) available in Lix. this caused 6 test failures, all in a single repo.

I’ve created a Lix issue to track this.

weird directory paths

a couple repos had really weird subdirectory paths, with spaces and exclamation marks. Lix refused to lock them with an “invalid URL” error. 29 tests failed because of that, but it appears to be fixed in latest Lix.

OpenSSL error while fetching nixpkgs

39 tests failed with an OpenSSL error while trying to fetch nixpkgs via git (so git+https://github.com/nixos/nixpkgs, not github:nixos/nixpkgs). I wasn’t able to reproduce this, but every single failed test was running on Lix. I will try to hunt down this bug and will report it if I can get it down to a reproducible form.

tarball query params

flake refs like https://example.com/?v=1 are valid in CppNix, but not in Lix. that’s because Lix always considers the ?v= part to be a flake input attribute, not a part of the URL. to express this input in Lix, you should write

inputs.foo = {
  type = "tarball";
  url = "https://example.com/?v=1";
};

honestly, I think Lix is right here, as it’s the only solution that allows to add more flake input attributes while preserving backwards compatibility. it also seems to be intentional. nevertheless, this caused 5 test failures.

attribute already defined

code like this:

{
  a = { b.c = 1; };
  a = { b.d = 2; };
}

fails in the version of Lix I used, but doesn’t fail on the latest version. this caused 2 test failures.

missing features

urlencoded ref

one repo used %2B in the name of a ref, relying on the fact that Nix github fetcher works over HTTP. this is very cursed and “broken” in Lix.

where CppNix fails

where my testing setup fails

methodology

what to check

unflake (and, really, flakes in general) consists of two parts: a dependency resolver that produces a lockfile and a runtime that does dependency injection.

to test the resolver we just ask it to produce a lockfile (by running unflake or nix flake lock). we explicitly remove flake.lock beforehand, so the resolver needs to start fresh.

testing the runtime is a bit more involved though: ideally, we’d like to build every flake output, but that would take forever. instead, we evaluate every output. actually, we can’t really evaluate every output, because we don’t know how to find them (flakes don’t really have a fixed output schema), so I evaluated some flakes from my test set to scrape a list of interesting outputs. in total, I tried to evaluate these attributes (each only for x86_64-linux system):

note that the last two are not “standard” and nix flake show will complain about them. I consider it to be moderately funny that they’re more common than some of the standard ones, like templates.

the full list of outputs I found is attached, although note that I only collected this for a subset of all repositories.

downloading every flake

well, that seems simple: search github for path:flake.nix, sort the results by some criteria (star count? freshness?), take top-n and fetch them. right?

wrong! github provides no way to do so. path: query is only available in code searches, but code searches provide no way to sort the results. additionally, no matter how you query, there’s a hard cap on amount of results you can actually see. (at the time of writing this, github only shows 140 results for path:flake.nix in GUI; when I was first researching this, it showed thousands of results, but you could only actually access a few pages).

this leaves us with third-party indices:

testing environment

I used hetzner CPX62 server to run the tests. each test (so each pair (nix implementation, flakes implementation) for each flake) was executed in a separate podman container (although with shared cache, see below).

I tried using a much beefier 32-core CCX53 server, but it didn’t do me much good. CPU utilization during the test was Not Great while load average was deep in the red.

a screenshot of grafana host exporter dashboard. it shows pressure being really low,
with maximum of 4.1% for irq pressure, CPU utilization at 36.9%, while system load is inexplicably 128.8%.
RAM usage is insignificant at 13.8%

ultimately, I’m still not sure what was the bottleneck. my theory is that it was the sheer amount of processes running at the same time, or maybe amount of syscalls done in every process. I tried using scx_bpfland, which didn’t really do anything. if anyone has any other theories, please reach out, because I’m puzzled. I still have prometheus node exporter data if that helps.

caching

locking a flake does an immense amount of requests to github. it’s not unheard of to be rate-limited for trying to do this too much, and I was trying to lock a flake ~30k times. I tried providing an access token and ignoring this problem, but I was hitting “soft rate-limits” and my test jobs were timeouting. if this were to work, I needed a cache.

I tried messing around with squid, but it was not really built for that and I kept getting cache misses. after a couple days, I gave up on trying to find a ready-made solution and wrote sona poki. it acts as a forward HTTPS proxy: every time it gets a request, it tries to serve it from the cache first. to do that, it also acts as a CA, issuing certs on the fly. thankfully, curl (which nix uses to make requests) supports HTTPS_PROXY env var, so it was pretty easy to configure.

my implementation of HTTP was rather rudimentary. for instance, cache key is (method, url, body), which completely ignores all the headers. this backfired once when a content-encoding: gzip version of the flake registry got cached (unflake was unprepared to handle that), but after replacing this particular cache entry, everything went more or less smoothly. it’s possible that there’re some flakes that broke because of that, but overall I think it’s an okay tradeoff.

in the end, the SQLite db serving as a cache was taking up 79 GiB of disk space and contained 20539 entries spread across 101 unique domain.

overall, cache hit rate was at impressive 96%, and cache was serving about 30 requests per minute on average.

running tests yourself

if you just want to analyze the data I’ve got from my run, ping me on some kind of social media (probably not IRC, I rarely log in there) or by email and I’ll send you the tarball.

all the code I used to run GNFC is available on Codeberg under CC0 license. it is, unfortunately, a huge mess, as it’s mostly a bunch of single-use scripts I wrote to massage the output data. the process of analyzing the results was still largely manual, and I’ll write something better if I am to do this again. feel free to reach out if you need any explanations.

that being said, you may consider check.sh to be an entrypoint here. it takes two arguments: path to the repo (relative to $FLAKES_DIR, see below) and path to the flake.nix file inside the repo. it then tries to run the four tests for this repo (for every combination of implementations) in a container called gnfc-nix (I used the same container I use for unflake tests), while forwarding port 1337 inside for caching. you should run sona poki at this port on your host. it expects to find unflake.py to be used for testing in the root directory of the repo.

you’ll need to set three environment variables to run the tests:

you will also need a self-signed CA cert for caching. scripts expect to find it in scratch/ca.pem in the root directory of the repo.

when the tests are done, the results will be available in your $OUT_DIR. each test gets its own directory at

$OUT_DIR/$repo/{nix,lix}/{nix,unflake}/$path/

where $path is a directory where flake.nix is located (or empty, if it was in the root of the repo).

inside this directory there’re following files:

classify_errors.py provides a basic classificator for test results, using regular expressions on stderr. error_classes.json contains regexes I used, and reviewed_paths.json is a list of paths that are assigned category misc after a manual review.

out_stats.sh uses this to provide Prometheus metrics on test results.

interactive_classificator.py is a useful script to populate error_classes.json: it picks some repo from scratch/unknown.tsv and lets you add a rule that classifies it, then applying it to the remaining unknown repos.