Merge master into nix-next#1717
Open
github-actions[bot] wants to merge 142 commits into
Open
Conversation
Update `harmonia`; deduplicate various cargo output hashes
Prepare for reuse by pulling process-management and polling logic out of `QueueRunnerBuildOne` into standalone modules: - `ProcessGroup.pm`: reusable labelled process group with `spawn`, `pump_logs`, and ordered teardown. - `wait_for_builds` in `QueueRunnerContext`: polling loop that checks the queue runner REST API until all build IDs are inactive. - `start_queue_runner` and `start_nix_daemon` in `QueueRunnerContext`: extracted from `QueueRunnerBuildOne` so future test helpers can start queue infrastructure without duplicating the setup. `QueueRunnerBuildOne` is rewritten to use these shared helpers. The queue-runner harness is still registered manually because `start_queue_runner` returns pre-bound sockets and a separate daemon harness that don't fit `ProcessGroup::spawn`'s simple model.
hydra-tests: Extract `ProcessGroup` and shared test helpers
Apply [semantic line breaks](https://sembr.org/) (one sentence per line) across all markdown files in the manual and repository. This makes diffs easier to review since changing one sentence no longer reflows an entire paragraph. Also add `CONTRIBUTING.md` documenting the convention, and clean up various pandoc artifacts along the way: escaped quotes (`\'`), escaped underscores (`\_`), broken multi-line list items, and non-breaking spaces.
Reformat markdown to use semantic line breaks
Forgot to do this when bumping harmonia. The commented stuff will return when we merge `nix-next`.
`buildproducts.path` can contain a sub-path below a store output (e.g. `/nix/store/…-nix-manual-2.31.4/share/doc/nix/manual`), but `OwnedBuildProduct::parse_paths` was calling `StoreDir::parse` which only accepts bare store paths. Any path with a `/` after the store path name failed with `invalid store path / symbol`, breaking the cached-build fast path for those builds. Move `RelativeStorePath` from the queue-runner into a new `store-path-utils` crate so the DB crate can use it directly. Change `OwnedBuildProduct`'s default type parameter from `StorePath` to `RelativeStorePath` and have `parse_path()` call `RelativeStorePath::from_path`, which correctly splits the full path into a `StorePath` base and a relative suffix.
When `create_step` encounters an existing step (`is_new=false`), it now re-checks whether the step's outputs have appeared in the local nix store since the step was first created. Without this check, builds whose outputs became available between poll cycles (e.g. via concurrent builds, substitution, or external upload) would get stuck in an infinite re-load loop: the DB says `finished=0`, the step already exists in memory so `create_build` inserts the build but never routes it to `handle_cached_build`. The fix adds two checks before returning `Valid(step)`: 1. If the step is already marked finished, return `None` immediately. 2. Query missing outputs from the local store; if all are present, mark the step finished, insert into `finished_drvs`, and return `None` so `create_build` routes to `handle_cached_build`.
Nix writes `Compression: bzip2` in narinfo files (see `src/libstore/nar-info.cc`), but hydra's parser only accepted `bz2` (the file extension, not the compression name). This caused errors like `invalid value for Compression: bzip2` when reading old narinfo entries from `cache.nixos.org`, making the queue runner treat those store paths as missing and rebuild them. Fix `FromStr` to accept `bzip2`, and fix `as_str` to write `bzip2` (matching nix). The file extension method `ext()` correctly uses `.nar.bz2` and is unchanged. Examples of affected narinfo files: - https://cache.nixos.org/2crhfrb8pqy61d32jaajpw33jjmny6xz.narinfo - https://cache.nixos.org/9fggf2bx5p29da89jrbmdh2fcd68m1mx.narinfo - https://cache.nixos.org/0xs59grdbgz7z2ci2ff4yfwb6jha1qaa.narinfo
The architecture doc was sitting outside `src/`, presumably because it was mostly notes and not yet cleaned up enough to be part of the user-facing docs. We will do that clean up next, but before we do that, move the file and expose it, so the subsequent diff is readable and not confused by the move.
- Organize components by deployment topology: master machine vs builder machines vs destination store, reflecting which components share a Nix store and which don't - Add `hydra-builder`, `hydra-notify`, `hydra-eval-jobset` — components that weren't documented before - Update `hydra-queue-runner` language from C++ to Rust - Add source layout and build system overview - Add database schema section with pointers to the SQL and migration docs - Update FAQ: both Nix and Hydra now have experimental dynamic derivations - Fix typos (`hydra-build-producs`, `obsolute`)
Add updated architecture section to the manual
Pick up latest changes from nix-community/harmonia.
Both `hydra-builder` and `hydra-queue-runner` had near-identical `build.rs` files that compiled the same `.proto` file and generated the same version constant. This consolidates that into a single `hydra-proto` crate with feature flags: - `client` — generates tonic client stubs (used by `hydra-builder`) - `server` — generates tonic server traits (used by `hydra-queue-runner`) - `db` — `From` impls bridging proto enums to `db::models` types Proto ↔ native type conversions are also consolidated here: - `ProtoStorePath` moved from `shared::proto`, now depends on `harmonia-store-core` directly rather than `nix-utils` - `Pressure`/`PressureState` — both components now use the generated proto types directly, eliminating duplicate structs and field-by-field conversion boilerplate - `PresignedUploadOpts`, `ConfigUpdate` — queue-runner now uses the proto types directly instead of maintaining trivial local wrappers - `BuildResultState` — restructured as a nested enum wrapping `hydra_proto::BuildResultState` with extra `Aborted`/`Cancelled` variants for queue-runner-internal states - `RelativeStorePath` — new proto message type making the store path vs sub-path split explicit on the wire; `From`/`TryFrom` conversions in `hydra-proto` bridge to `store_path_utils::RelativeStorePath` - `BuildProduct.path` changed from opaque `string` to structured `RelativeStorePath` message Also: - Groups all local workspace crates under `[workspace.dependencies]` with `workspace = true` references instead of relative paths - Updates the architecture doc with descriptions of each shared crate
Extract hydra-proto crate for shared `gRPC` code generation
Introduces `store.proto` (`nix.store.v1` package) containing types that are generic to any Nix store, not specific to Hydra: - `StorePath`, `StorePaths`, `RelativeStorePath` - `ValidPathInfo` — store path with NAR hash/size, references, deriver, CA - `NarInfo` — wraps `ValidPathInfo` with cache-specific fields (url, compression, file hash/size) `streaming.proto` (`runner.v1`) imports these and uses them for `BuildMessage.drv`, `BuildProduct.path`, `PresignedUploadComplete`, etc. `PresignedUploadComplete` is restructured to contain a `NarInfo` message instead of flattening all fields, making the layering explicit.
Split `.proto` into generic Nix store types and Hydra-specific types
The `binary-cache` crate had its own `NarInfo` struct that duplicated
what harmonia now provides as `harmonia_store_nar_info::NarInfo`. This
replaces it with the upstream type, which also brings in
`harmonia-store-path-info` for `ValidPathInfo`/`UnkeyedValidPathInfo`.
The local struct was flat; the harmonia type is nested:
`NarInfo { path, info: UnkeyedNarInfo { info: UnkeyedValidPathInfo, url, compression, ... } }`.
All construction and field access sites are updated accordingly.
`From<harmonia_store_nar_info::NarInfo>` and `TryFrom<proto::NarInfo>`
conversions are added in the proto crate so both components can convert
between native and wire representations without manual field-by-field
marshalling. Round-trip unit tests verify the conversion is lossless.
Also adds `store_dir` field to proto `ValidPathInfo` to match the
harmonia type. The queue-runner now asserts the store dir matches
rather than silently overriding it.
Replace local `NarInfo` with `harmonia-store-nar-info`
Consumers now import `StorePath`, `StoreDir`, `OutputName`, etc. directly from `harmonia-store-core` rather than through `nix-utils` re-exports. This makes the actual dependency on harmonia explicit at each use site. `nix-utils` retains its own local functionality (`parse_store_path`, `BaseStore`, `LocalStore`, `realise_drv`, etc.) but no longer acts as a forwarding layer for harmonia types.
Remove harmonia re-exports from `nix-utils`
Replace stringly-typed fields in `store.proto` with proper message types: - `Hash` — algorithm enum + raw digest bytes - `Signature` — key name + signature string - `ContentAddress` — method enum + hash `UnkeyedValidPathInfo.nar_hash` is now `Hash` (not string), `signatures` is `repeated Signature` (not `repeated string`), `ca` is `optional ContentAddress` (not `optional string`). `UnkeyedNarInfo.file_hash`/`file_size` renamed to `download_hash`/`download_size` with proper `Hash` type. All `From`/`TryFrom` impls use `&` for harmonia→proto direction (no need to consume the source) and by-value for proto→harmonia (proto types are consumed). Round-trip tests cover every layer.
Use structured proto types for `Hash`, `Signature`, and `ContentAddress`
Move each test from the monolithic `nixos-tests.nix` into its own file under `nixos-tests/`, with shared VM configuration extracted to `common.nix`. No functional changes to existing tests.
Split NixOS tests into separate files
Remove `nix-utils` C++ FFI, use harmonia daemon protocol
The nix daemons (one for the queue-runner, one for the builder) were started with bare `IPC::Run` harnesses whose stdout/stderr were never flushed to test output. When a daemon misbehaved (e.g. connection reset), there was no log to diagnose it. To fix this have `start_nix_daemon` register the daemon in the caller's `ProcessGroup`. The `ProcessGroup` machinery we made is exactly for this purpose, and we should have had this function use it. In particular, now that it is part of the process group, `pump_logs` will pick it up. Also, there is no need for the separate `$daemon_harness` return value from `start_queue_runner` — callers no longer need to manually `kill_kill` daemon harnesses since `$pg->stop` handles everything — so that return value is removed too. Finally, factor out the newly improved `start_nix_daemon` into its own module, since it should not be more strongly associated with either the queue-runner or the daemon. The queue-runner daemon is labelled `queue-runner daemon` and the builder daemon `builder daemon`, making it easy to tell them apart in test output. Hopefully this helps us debug the spurious issues we're seeing.
hydra-tests: pump nix daemon logs via `ProcessGroup`
build(deps): bump treefmt-nix from `790751f` to `db94781`
The builder was creating a new `ConnectionPool` in every `process_build` call. Under concurrent load each pool opened its own set of daemon connections, so the total could reach `num_builds × (CPU_cores + 1)` — all hitting the same nix daemon and its SQLite database. CI logs (now visible thanks to the daemon log pumping) showed `SQLite database is busy` errors from the builder daemon, which surfaced as `Connection reset by peer` on the builder side. Fix by creating the pool once in `State::new` and storing it as `self.pool`.
Ship build logs to queue-runner via gRPC Capture log messages from the daemon's `build_paths_with_results` `ResultLog` by polling it as a Stream alongside the Future, then zstd-compress the collected output and ship via the existing `BuildLog` gRPC method so the queue-runner can write it to `/nix/var/log/nix/drvs/`. Co-authored-by: amaanq <amaanq@users.noreply.github.com>
hydra-builder: share a single `ConnectionPool` across builds
Use daemon protocol rather than shelling out to build derivation
The old C++ queue-runner in `build-remote.cc` resolved input derivations before sending a build to the remote: `sendInputs()` iterated `inputDrvs`, looked up each output path via `DerivationOutput::path()`, and stuffed them into `inputSrcs` on a `BasicDerivation`. It then copied just `inputSrcs` to the remote and called `buildDerivation` with the in-memory `BasicDerivation` — the `.drv` file was never sent. The Rust rewrite lost this: the builder would call `fetch_drv_requisites` to discover and import the full drv closure from the queue runner's store, requiring the `.drv` file and all its references to be accessible. While this has some advantages — notably, trusted-user mode is needed for the force-resolved derivations (in the input-addressed case) but not the original ones — it also has major disadvantages: - Sends unnecessary data to builders. - Allows builders to (erroneously) rebuild dependencies that should already be built, rather than giving them no choice but to fail if they cannot download the pre-built outputs. We were hitting the second problem in practice. Restore the original design. Resolve derivation inputs on the queue-runner side using `try_resolve_force` (which resolves `Built` input references to concrete store paths for all derivation types), encode the result as protobuf, and send it to the builder as `resolved_drv` (a `nix.store.derivation.v1.Basic`) in the `BuildMessage`. The builder decodes the `BasicDerivation` and calls `build_derivation` via the harmonia daemon protocol, passing it in-memory — no `.drv` file needs to exist in the builder's store. `FetchRequisites` now takes `StorePaths` (the resolved `inputSrcs`) rather than the drv path, so only the actual input closure is fetched. This also improves on the old C++ code for CA floating derivations: `DerivationOutput::path()` returns `nullopt` for those, so the old code couldn't resolve their inputs. `try_resolve_force` properly resolves them via recorded realisations in the DB. For CA floating derivations that resolve to a different drv path, the existing two-phase build dance is preserved. Derivations with `Deferred` outputs (input-addressed depending on a CA derivation) also need their output paths filled in after `try_resolve_force`, then routed through the two-phase dance like CAFloating to detect if resolution changed the drv path. Without this, `$out` is empty and builds fail. Also introduces `store/derivation.proto` with protobuf definitions for `BasicDerivation`, `DerivationOutput`, and `ContentAddressMethodAlgorithm`, along with bidirectional conversion impls between proto and harmonia types. Co-authored-by: Amaan Qureshi <git@amaanq.com>
This adds color-eyre and tracing-error as workspace dependencies. hydra-tracing installs `tracing_error::ErrorLayer` in the subscriber, and calls `color_eyre::install()` from `init()` so both daemons get the report renderer and panic hook.
main now returns color_eyre::Result, so a fatal BuilderError (or a typed error like TracingInitError) is rendered by the color-eyre handler the scaffold installs, with source chain and location. State::new and the rest of the state layer are still on anyhow, so into_report bridges those residual sites by carrying the chain as the report message until they are typed.
flake: bump nixpkgs to 26.05
Send force-resolved `BasicDerivation` to builder via protobuf
hydra-builder: typed errors + eyre/tracing-error reporting
Assisted-by: GNU sed 4.9
queue-runner: replace anyhow with eyre and typed errors
chore: cleanup clippy lints
`!$project eq ""` is actually interpreted as `(!$project) eq ""`, certainly not what was intended here.
Fix precedence issue in Hydra API
It is unclear how bad the problem is (e.g. are we actually building derivations twice, or just failing to mark one things cached?). So let's just skip this part of the cache for now.
Skip the part of the early-cutoff test that fails intermittently
The `StorePath` field of a narinfo holds the full path (`/nix/store/<hash>-<name>`); cache.nixos.org serves it that way and `harmonia_store_nar_info`, used for our own uploads, writes it that way. Our hand-rolled `parse_narinfo` fed the value into `StorePath::from_str`, which only accepts the base name, so every spec-conforming narinfo failed to parse. The queue runner therefore could not read back narinfos from its own binary cache: `has_narinfo` failed for every path, uploads were repeated endlessly and `copy_paths` errors failed the affected builds. Observed on a 120-build nixos-small evaluation against an S3 cache populated by the runner itself: 2067 `invalid value for StorePath` errors and all 120 builds failing with status 2. harmonia has since grown its own `parse_narinfo_txt` (mirroring nix's `nar-info.cc`), so delegate to it instead of maintaining our own. It takes the store directory explicitly, so carry it on `S3CacheConfig`, read from the `store` query parameter of the cache URL (the nix S3 convention, defaulting to `/nix/store`); both the upload and download paths now use `self.cfg.store_dir` rather than threading it in. Besides reading `StorePath` as a full path, this also fixes two ways our parser diverged from nix: `Deriver: unknown-deriver` and an optional `Compression` field.
binary-cache: read narinfos with harmonia's parser
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated PR to keep
nix-nextin sync withmaster.