docs(readme): canonicalize benchmark blog link to uffs.io#482
Open
githubrobbi wants to merge 20 commits into
Open
docs(readme): canonicalize benchmark blog link to uffs.io#482githubrobbi wants to merge 20 commits into
githubrobbi wants to merge 20 commits into
Conversation
…lta) Phase 0 of the two-tier index project. The CSR indexes (trigram / children / ext) are immutable read-optimized layouts, so "incremental maintenance" is an LSM/Lucene-segment redesign — immutable base CSR + mutable delta overlay + tombstones, queried as base ∪ delta minus tombstones, with the existing full rebuild demoted to an occasional compaction step. Turns apply from O(total records) into O(changed). The doc specifies: architecture + per-op semantics, the search-path integration choke points (trigram_search / children_of / records_with_ext), phased delivery (trigram-first for the ~80% win), the mandatory oracle test (base+delta must be observationally identical to a full rebuild, and byte-identical after compaction), a baseline + timing-regression gate, the removable IDXDELTA dev-instrumentation convention (build-id, per-apply / per-search timing) mirroring USNFIX, the WIN dev test-script (idx-delta-verify.rs), and a tracking table. Junior-dev-executable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…y timing Scaffolding for the incremental-index-maintenance work (design: docs/architecture/incremental-index-maintenance.md) — measure first, build later. All dev-only, marked IDXDELTA, removable in Phase 5. Which-build stamp: a uffs-daemon build.rs emits UFFS_GIT_SHA (short commit + -dirty); startup logs `IDXDELTA build active version=… git=…`. The WIN test-script fails fast if the running daemon lacks it — closing the stale-binary trap we hit during USN testing. Fine-grained per-apply timing (each meaningful step, not just the rebuild): whole-body CLONE (shard.rs — the Arc-swap copies the entire index, the big cost the rebuild timing alone misses and the one base+delta shrinks most), per-change LOOP (the O(changed) mutation, timed apart), and REBUILD (children / paths / trigram / ext, each separately). Logged in whole microseconds (`*_us`, integers) — uffs-core denies float arithmetic, so this respects that policy (and keeps sub-ms loop precision) rather than allow-ing around it; the WIN script renders ms. Refactor: the rebuild + IDXDELTA timing + batch-summary log move to a new compact_loader/rebuild.rs submodule (cohesive O(n) step; keeps compact_loader.rs under the 800-LOC policy; houses the temp timing for Phase-5 removal). No behaviour change. scripts/windows/idx-delta-verify.rs: the WIN rig (mirrors usn-verify.rs). Confirms the build, drives escalating create bursts + a rename/delete smoke, extracts the IDXDELTA-TIMING lines, writes _run/baseline.txt for regression detection in later phases. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase-0 baseline (build 629966b, live C: = 3.89M records) overturned the doc's original assumption that trigram was the ~80% win. Measured per-apply: compute_path_lengths 623ms <- #1, bigger than trigram trigram rebuild 378ms whole-body clone 166ms <- hidden by rebuild-only timing ext / loop / children 84/62/54ms FULL APPLY ~1367ms (not the ~600ms guessed) Re-sequenced §4 phases by measured cost (biggest lever first): 1. incremental compute_path_lengths (per-record + renamed-subtree Δ; NOT a base+delta overlay) — full §5.5 junior-dev guide added 2. trigram delta 3. Arc-share the clone 4. ext+children delta 5. unify + re-tune interval 6. remove IDXDELTA dev helpers Adds the captured numbers as docs/architecture/baselines/incremental-index- 2026-06-26.json (the §8 regression reference) and marks the done Phase-0 items (build stamp, timing, WIN rig) in §11. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the per-apply O(total-records) compute_path_lengths BFS (623ms, the #1 cost in the measured baseline) with an O(changed) per-record update for normal USN poll batches. - compact.rs: PathChange{idx, subtree} + update_path_lengths_incremental + path_len_from_parent + shift_subtree_path_len (iterative DFS over the children CSR, propagating a directory-rename's length delta to the whole subtree, clamped to u16). - apply_create / apply_rename thread &mut Vec<PathChange>; create/file- rename push a single O(1) record, directory-rename pushes subtree:true. - rebuild.rs: rebuild children CSR FIRST (so the subtree walk sees current adjacency), then gate incremental-vs-full path update on a 50k batch threshold; cold loads (empty change set) still take the full BFS. - Oracle gate (compact_loader_path_oracle_tests.rs): the incremental path_len must be byte-identical to a from-scratch compute_path_lengths rebuild across a batch of dir-rename + create + file-rename. Passes. IDXDELTA-TIMING now reports paths_us for the incremental path so the WIN rig can confirm the 623ms -> ~0 win against the committed baseline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Incremental compute_path_lengths landed (9806bc3); path-len oracle gate is green. Phase 2 (trigram delta) is next. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the mutable delta-overlay type that the trigram / ext / children
base+delta search will read through (incremental-index-maintenance
§5.1). compact/delta.rs:
- IndexDelta { trigram, ext, children: FxHashMap<_, Vec<u32>>,
tombstones: FxHashSet<u32>, touched_records }.
- add_record (sorted+deduped binary-search insert across all three
posting maps; root u32::MAX parent adds no child posting),
tombstone (idempotent), is_tombstoned, len/is_empty (compaction
trigger), and the per-key postings accessors.
- The sorted/deduped posting invariant is what makes the eventual
base∪delta merge a linear pass.
Unit-tested (sorted/dedup insert, root sentinel, idempotent tombstone,
rename-as-two-touches). The base∪delta sorted-merge primitive itself
lands in the Phase-2 commit wired directly into trigram_search, so it
is never dead scaffolding. No DriveCompactIndex field yet — that is
added in Phase 2 where each of the ~20 construction sites is touched
once, with the change that gives the field meaning.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…se 2 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The first WIN run exposed a 0.5 s-per-apply regression on small batches: two applies (8 and 1 changes, loop_us=0) hit paths_us≈507 ms while every create/rename batch was 1-18 µs. Those were delete-only batches — a delete tombstones its record and pushes no PathChange, so path_changes is empty, and the gate wrongly fell back to the full O(total) compute_path_ lengths BFS. apply_usn_patch is never the cold-load path (build_compact_index does the cold BFS directly), so an empty path_changes during apply means "no record's path_len changed" → the correct work is none. A delete never shifts any surviving record's path_len. Drop the is_empty() arm; the only apply-time full-recompute fallback is now a >50k pathological batch. update_path_lengths_incremental is already a no-op over an empty slice. Oracle: add delete_only_batch_leaves_path_lengths_correct_without_full_ recompute. The shared assert now compares LIVE records only — a tombstoned record's path_len is meaningless (incremental leaves it stale, a full BFS recomputes it as a root); that divergence is correct and excluded. Expected effect: mean paths drops from 145 ms to sub-ms; full_apply ~800 ms -> ~640 ms (trigram ~390 ms now dominant -> Phase 2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s HEAD Kills the stale-binary trap for good. The rig now, before anything else: - BIN SYNC: resolves the release dir cargo *actually* uses via `cargo metadata` target_directory (honours CARGO_TARGET_DIR / .cargo/*.toml build.target-dir; override with UFFS_RELEASE_DIR), then copies uffs/uffsd (+ uffs-broker/uffsmcp if built) into ~/bin, printing each binary's build mtime. Required bin missing → bail "build first". - BUILD-ID MATCH GUARD: build-confirmation now extracts git="<sha>" from the IDXDELTA marker and asserts it equals `git rev-parse --short HEAD`; a resident daemon from an older build → hard fail with the fix. So the WIN loop is just: build → run. No manual `copy C:\rust-target\... \release\* ~/bin`. The target_directory JSON parse is a focused hand-scan (no serde) that unescapes Windows `C:\\..` paths; unit-checked locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mbing) Routes every trigram caller through one DriveCompactIndex::trigram_search that reads the base ∪ delta overlay. No behavior change yet: the delta is always None until apply populates it (Phase 2b), so trigram_search is a zero-overhead delegate to the base TrigramIndex::search. - DriveCompactIndex gains `delta: Option<IndexDelta>` (None on fresh / compacted / cache-loaded; never serialized). All ~20 construction sites updated to `delta: None`. - trigram_search: when a delta is present, merge per needle-trigram the base posting with the delta posting (delta::merge_postings), intersect (trigram::intersect_in_place, now pub(crate)), then resolve tombstones on the FINAL candidate set — keeping a tombstoned record only if it was re-added under a name covering every needle trigram. This is what lets a renamed file appear under its new name yet vanish from its old one; filtering per posting list would wrongly hide the re-added record. - trigram.rs: extract the shared needle->trigram packing into needle_trigrams(); expose get_posting + intersect_in_place as pub(crate). - delta.rs: merge_postings sorted-union (no tombstone — see above). - Migrate the 3 trigram callers (tree, prefix_search, query) to trigram_search; each previously passed drive.fold, so behavior-identical. Tests (compact_trigram_delta_tests.rs) pin the overlay semantics with a manually-populated delta: create-visible, rename-visible-under-new-name + gone-from-old, delete-invisible, short-needle None. 867/867 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…dules Pure, behavior-preserving split of the oversized compact.rs into cohesive compact/ submodules. Every public item is re-exported so the canonical `crate::compact::X` paths used across the workspace are unchanged — no call site outside the module moved. compact/record.rs CompactRecord + NTFS metafile-name allowlist (189) compact/children.rs ChildrenIndex (CSR parent→children) (111) compact/extension.rs ExtensionIndex (CSR ext_id→records) (102) compact/path_len.rs compute_path_lengths + Phase-1 incremental fns (214) compact/builder.rs build_compact_index + ADS/links/shrink/upcase (422) compact.rs DriveCompactIndex + HeapReport + impl + re-exports (385) compact.rs drops off the file-size exception list (was "13 over"; now 385, well under 800). 867/867 uffs-core tests pass unchanged (identical count pre/post — proves a pure move); clippy -D warnings, rustdoc -D warnings, lint-ci-windows all clean. This also tidies the tree for Phase 2b: the compact() method + apply delta-population drop cleanly into builder.rs / a slim compact.rs rather than a 1363-line file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… rebuild
The 338 ms per-apply trigram rebuild is gone. apply_usn_patch now overlays
each batch onto the IndexDelta instead of rebuilding the base:
- DriveCompactIndex::apply_trigram_delta(adds, tombstones): adds each
created/renamed record's new-name trigrams to the delta and masks the
deleted/renamed-away/reused-slot records via tombstones. Folds back to a
fresh base (compact_base) only when the delta crosses
TRIGRAM_COMPACT_THRESHOLD (50k touched records) — so trigram_us is ~0 on
normal applies, a one-off full build on the occasional compaction tick.
- compact_loader/apply.rs: the per-change mutation cluster (StagedCreate,
stage_create, overwrite_slot, apply_{delete,create,rename}) extracted to
a submodule; each apply fn now also collects the trigram tombstone set
(deletes, renames, FRS-reuse overwrites). path_changes doubles as the
trigram-ADD set. compact_loader.rs 826 -> 592 LOC.
- rebuild.rs: replace the TrigramIndex::build call with apply_trigram_delta;
IDXDELTA-TIMING gains a `compacted` flag.
End-to-end oracle (compact_loader_trigram_oracle_tests.rs): a real
apply_usn_patch batch (create + rename + delete), then assert trigram_search
through base + delta returns IDENTICAL candidates to a compacted rebuild —
across created, renamed (new + old name), deleted, and untouched files.
868/868 green; clippy/rustdoc/file-size all clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per the request to stress beyond 1k files: - BURSTS 10/100/1000 → 1000/10000/100000. The 100k burst crosses TRIGRAM_COMPACT_THRESHOLD (50k) so it also exercises a delta compaction (full trigram refold) under load; the smaller bursts measure steady-state delta-overlay apply (trigram_us ≈ 0). - Replace the fixed-sleep freshness probe with poll_until_visible: polls a per-round filename prefix until that burst's `count` is search-visible (or a size-scaled budget elapses), so the report shows true creation throughput (files/s) AND apply-to-searchable latency, and flags an apply backlog instead of silently measuring the settle constant. Also marks Phases 1/2a/2b + the compact.rs decomposition done in the design-doc tracking table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n threshold A burst larger than TRIGRAM_COMPACT_THRESHOLD (e.g. the verify rig's 100k create) would populate the delta with 100k postings only to discard them at the post-population compaction check — pure wasted work. apply_trigram_delta now checks `pending_delta + batch_size > threshold` up front and, if so, refolds the base directly via compact_base (the records already reflect every change in the batch). This also catches the accumulation case where a small batch tips an already-large delta over the line. Reduces to compact_base (oracle-proven equivalent to a full rebuild), so the end-of-fn compaction branch is now unreachable and removed. Trigram + path oracles green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ounter
- Bin sync: a running uffs-broker (LocalSystem service) holds its exe open,
so the best-effort optional copy hit os error 32 and aborted the whole run
before any measurement. Optional-bin copy failures now warn ("skip …") and
continue; only uffs + uffsd (the rig's actual dependencies) hard-fail.
- Remove the now-unused `total_created` accumulator (each burst polls its own
per-round count) that tripped unused_variables/unused_assignments.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The stale-daemon guard compared the daemon's git SHA to HEAD verbatim, so a HEAD that advanced purely through scripts/ or docs/ (e.g. a verify-rig tweak) falsely flagged a current binary as stale and aborted the run. The guard now diffs the daemon SHA against HEAD and only bails when a build-affecting path changed (crates/**, Cargo.toml, Cargo.lock, rust-toolchain*); a non-source advance prints "binary is current" and proceeds. Fail-safe: assumes stale if git can't answer. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The old mutate round searched `idx_0_1` to check a deleted file — but that substring matches 111 bulk `idx_0_1*` files, so "expect 0" was a false signal (the live run showed 111, which was correct-by-accident). Replace it with `idxmutate_*` sentinels that share no trigram with the bulk files, and poll-until-applied (visible / absent) instead of a fixed sleep: - rename idxmutate_src → idxmutate_renamed: expect 'idxmutate_renamed' >= 1 - delete idxmutate_del: expect 'idxmutate_del' → 0 - old name idxmutate_src → 0 (renamed away) Each now gives a clean pass/fail with the real apply latency. Drops the now-unused SETTLE constant (every probe polls). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The daemon clones the whole DriveCompactIndex before each apply (lock-free COW snapshot for readers). That deep-copied the immutable inverted indexes — the trigram CSR alone is ~hundreds of MB on a multi-million-record drive. Make trigram / children / ext_index `Arc<…>`. The apply path never mutates them in place (it overlays on the delta and only ever *replaces* the whole index at compaction/rebuild), so Arc + replace-the-pointer is a perfect fit: the per-apply clone now pointer-clones these bases (a refcount bump) and only deep-copies records + names + the small delta. Read sites are unchanged — Arc derefs transparently through `.search()` / `.get_posting()` / `&drive.children`. - compact.rs: field types → Arc; compact_base wraps the refold in Arc::new. - rebuild.rs: the per-apply children/ext rebuilds wrap in Arc::new. - builder.rs / compact_cache.rs / fixtures: construction sites wrap in Arc::new. - New code uses alloc::sync::Arc (workspace lint convention); each touched file keeps its existing Arc import. Expect `clone_us` to drop materially on the WIN baseline (the CSR portion of the ~135ms clone). 868/868 uffs-core + 333/333 daemon green; clippy -D warnings, rustdoc, lint-prod, lint-ci-windows, file-size all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rebuild) Drop the ~58 ms per-apply ExtensionIndex rebuild. `--ext` queries now read through DriveCompactIndex::records_with_ext (base ∪ delta): - records_with_ext(ext_id) -> Cow<[u32]>: zero-alloc borrow of the base CSR slice when delta is None; otherwise merges base + delta postings and validates each candidate against the live records (keep iff records[idx].extension_id == ext_id && name_len != 0). That records check is what makes a renamed extension (foo.log -> foo.pdf) and a delete correct WITHOUT an ext tombstone — a stale base posting just fails the check. - apply_trigram_delta renamed apply_index_delta; it now always adds the ext + children postings (only the trigram postings stay gated on name >= 3 chars), so a short-named create/rename is never missed by --ext. - compact_base refolds the ext base too; rebuild.rs drops the ext rebuild (ext_us now ~0 in IDXDELTA-TIMING). - Migrate the 3 ext readers (path_sorted / numeric / path_only top-N) to records_with_ext; the 3 post-apply ext unit tests assert through it. Oracle extended: records_with_ext through the overlay equals the compacted rebuild for every ext id, across create / rename / delete. 868/868 core + 333/333 daemon; clippy -D warnings, rustdoc, lint-prod, file-size all clean. Children stays full-rebuilt — Phase 4b moves it onto the overlay (higher care: it feeds FastPathResolver + the Phase-1 subtree walk). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The site migrated from skyllc-ai.github.io to the canonical uffs.io domain (old URL now 301-redirects). Point the benchmark story link at uffs.io directly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Canonicalizes the public benchmark-blog link in the main README to the
uffs.iofront door.Change
https://skyllc-ai.github.io/blog/benchmarking-against-everything/becomeshttps://uffs.io/blog/benchmarking-against-everything/(target verified live, HTTP 200).Scope notes
skyllc-ai.github.ioreferences live in gitignoreddocs/dev/marketing/artifacts and are legitimate (the GitHub Pages repo name, thewwwCNAME target, and historical migration notes), so they are intentionally left as-is.CHANGELOG.mdline 1154 keepsgithubrobbi/UltraFastFileSearchon purpose: it is a historical fact about the pre-org-move fork.Part of the cross-repo link-canonicalization pass (the three pinned repos were already updated). This was the last remaining item; it was blocked only by an unrelated red lint gate that is now green.
🤖 Generated with Claude Code