Skip to content

Vectorize comet propagation to fix locked-on-target CPU hog#470

Merged
brickbots merged 4 commits into
mainfrom
fix/vectorize-comet-propagation
Jun 16, 2026
Merged

Vectorize comet propagation to fix locked-on-target CPU hog#470
brickbots merged 4 commits into
mainfrom
fix/vectorize-comet-propagation

Conversation

@brickbots

@brickbots brickbots commented Jun 15, 2026

Copy link
Copy Markdown
Owner

⚠️ Attribution update. This PR's original write-up framed the comet CPU hog as the cause of the 2.6 whole-OS freeze. A follow-up real-hardware endurance test — run after this fix, with the hog already gone — found a second, distinct failure mode: an IMU-process memory leak (~16 MB/min), fixed in #472. On a 2 GB Pi that leak exhausts RAM → swap-death / OOM, which is the more likely cause of the hard freeze. Revised picture: this PR fixes the comet CPU hog (sustained high CPU, UI starvation, heat); #472 fixes the memory leak (the OOM/freeze). Both are needed for a healthy long observing session. The original analysis below — including the “not memory” line — predates the endurance finding.

Problem

During observing — whenever PiFinder has a solve / GPS lock — the comet catalog continuously pegs a CPU core. Root cause (py-spy profiled + ablation-proven in a prior session): calc_comets advanced all 957 comets through a per-comet skyfield Kepler-propagation loop (2 .at(t) calls each, ~63–70 s total). Worse, while the catalog was still initialized=False, the 5 s TimerMixin tick kept queueing do_timed_task calls on _task_lock during each multi-second init, so they ran back-to-back continuously. Being pure-Python skyfield, it held the GIL and starved the UI loop. Regressed by 4aafd89a ("replace incremental comet update with full reinit", #391).

Sustained one-core load (heat + GIL-held UI starvation) makes observing sluggish, and was initially diagnosed as the cause of the whole-OS freeze. The endurance follow-up (#472) revised that: the hard freeze was primarily a memory leak in the IMU process. The comet hog is a real, distinct pathology and a compounding factor — but this PR alone does not resolve the OOM/freeze.

Fix

Replace the per-comet loop with a single batched numpy/skyfield propagation. Skyfield 1.45's propagate() is array-aware and mpc._comet_orbits() builds one KeplerOrbit over all rows, so every comet is advanced in one call.

  • Positions are read straight from the equatorial state vectors and rotated by ts.J2000.M to reproduce the old radec(ts.J2000) framing exactly.
  • The magnitude cut is phrased as ~(mag > 15) to preserve the old quirk of keeping unknown/NaN-magnitude comets (nan > 15 is False).
  • The old per-comet process_comet is retained as a tolerant fallback (one bad MPC row can't break the whole batch) and as the test oracle.
  • Cadence/threshold logic is unchanged — the per-calc cost was the whole problem; with a fast calc, initialized flips immediately and the cadence settles to the intended 293 s.

Validation

  • Unit benchmark (957 real comets): calc_comets ~63–70 s → 0.13 s (~500×), output byte-identical: max RA/Dec separation 0.0000″, |Δmag| = 0, distances match to ~1e-14 AU, identical comet set.
  • Headless A/B (debug camera + fake GPS, same harness, only the code differs): main-process CPU ~100 % sustained → 1–2 %; py-spy went from stumpff/propagate/_at dominating to no comet/skyfield frames.
  • Real hardware (service + Tools → Test Mode): AFTER 2 %, no skyfield frames (the prior session's real-hardware BEFORE was 90 %).

Tests

tests/test_comets.py pins the vectorized path to the per-comet oracle (RA/Dec/mag/distance parity over a spread of eccentricities — also guards against a future skyfield release breaking the batched call) and asserts the NaN-magnitude inclusion quirk and the no-location empty-result path.

ruff check/ruff format clean; 430 unit tests pass (incl. 4 new); mypy clean on the changed files.

Related

🤖 Generated with Claude Code

brickbots and others added 4 commits June 14, 2026 21:24
calc_comets advanced all 957 comets through a per-comet skyfield
Kepler-propagation loop (2 .at(t) calls each). While the comet catalog
was still uninitialized, the 5s TimerMixin tick kept queueing
do_timed_task calls on _task_lock during each multi-second init, so they
ran back-to-back and -- being pure-Python skyfield holding the GIL --
pegged a CPU core continuously whenever PiFinder had a solve / GPS lock
(i.e. exactly during observing). This was the 2.6 whole-OS observing
hang (regressed by 4aafd89, #391).

Replace the loop with a single batched numpy/skyfield propagation:
mpc._comet_orbits() builds one KeplerOrbit over all rows and skyfield's
propagate() is array-aware, so every comet is advanced in one call.
Positions are rotated by ts.J2000.M to reproduce the old radec(ts.J2000)
framing exactly; the magnitude cut keeps the old ~(mag > 15) semantics
(unknown/NaN-magnitude comets are still included). The old per-comet
process_comet path is retained as a tolerant fallback and test oracle.

957-comet calc: ~63-70s -> 0.13s (~500x), output byte-identical
(max RA/Dec separation 0.0000", dmag 0, distances ~1e-14 AU). Headless
A/B main-process CPU: ~100% sustained -> 1-2%; real hardware (Tools ->
Test Mode): handoff-measured 90% -> 2%. skyfield/keplerlib no longer
appears in py-spy.

Add tests/test_comets.py pinning the vectorized path to the per-comet
oracle (guards skyfield-version drift) and the NaN-magnitude inclusion
quirk.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tests

- calc_comets fallback now logs at error + exc_info with a distinctive
  marker. The Comets logger inherits root=ERROR, so the prior warning was
  suppressed in production -- a latent bug in the vectorized path could have
  silently run the slow per-comet path (~minute of CPU) every cycle.
- Drop the unused "row"/"orbital_elements" keys from both result dicts
  (nothing reads them; the vectorized path no longer builds a discarded
  per-comet Series, and the two paths now share an identical schema).
- _calc_comets_vectorized: early-return {} on an empty dataframe (self-safe;
  calc_comets already guards before calling it).
- Correct the docstring's account of why the old loop pegged a core: the
  5s-while-uninitialised timer + _task_lock requeue, not the 293s interval.
- Add tests for the N==1 squeeze branch, the vectorized->per-comet fallback,
  and comet_names filtering (previously uncovered).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The comet tests loaded the production comet file
(PiFinder.utils.comet_file -> astro_data/comets.txt), which is gitignored
and only present on a live unit after a download. CI therefore had no comet
data and 6 of 7 tests failed with FileNotFoundError.

Check in a real MPC CometEls.txt snapshot as a test fixture under
python/tests/data/comets.txt — deliberately *outside* astro_data/ so no
production code path reads it. An autouse fixture monkeypatches
comets.comet_file onto it for the test module only, leaving the real
download path untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the file's test-only nature obvious at a glance and avoid any
confusion with the production astro_data/comets.txt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@brickbots

Copy link
Copy Markdown
Owner Author

Heads-up for reviewers: the follow-up real-hardware endurance test found a second, distinct failure mode — an IMU-process memory leak (~16 MB/min) — now believed to be the primary driver of the OOM/freeze this PR was originally credited with fixing. Fixed in #472 (which also rides a small contrast-reserve log-spam guard). I've updated the description above to match. TL;DR: this PR fixes the comet CPU hog; #472 fixes the memory leak; both are needed for a healthy long observing session.

@mrosseel

Copy link
Copy Markdown
Collaborator

looks good, great catch!

@brickbots brickbots merged commit 194c9b5 into main Jun 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants