Vectorize comet propagation to fix locked-on-target CPU hog#470
Merged
Conversation
calc_comets advanced all 957 comets through a per-comet skyfield Kepler-propagation loop (2 .at(t) calls each). While the comet catalog was still uninitialized, the 5s TimerMixin tick kept queueing do_timed_task calls on _task_lock during each multi-second init, so they ran back-to-back and -- being pure-Python skyfield holding the GIL -- pegged a CPU core continuously whenever PiFinder had a solve / GPS lock (i.e. exactly during observing). This was the 2.6 whole-OS observing hang (regressed by 4aafd89, #391). Replace the loop with a single batched numpy/skyfield propagation: mpc._comet_orbits() builds one KeplerOrbit over all rows and skyfield's propagate() is array-aware, so every comet is advanced in one call. Positions are rotated by ts.J2000.M to reproduce the old radec(ts.J2000) framing exactly; the magnitude cut keeps the old ~(mag > 15) semantics (unknown/NaN-magnitude comets are still included). The old per-comet process_comet path is retained as a tolerant fallback and test oracle. 957-comet calc: ~63-70s -> 0.13s (~500x), output byte-identical (max RA/Dec separation 0.0000", dmag 0, distances ~1e-14 AU). Headless A/B main-process CPU: ~100% sustained -> 1-2%; real hardware (Tools -> Test Mode): handoff-measured 90% -> 2%. skyfield/keplerlib no longer appears in py-spy. Add tests/test_comets.py pinning the vectorized path to the per-comet oracle (guards skyfield-version drift) and the NaN-magnitude inclusion quirk. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tests
- calc_comets fallback now logs at error + exc_info with a distinctive
marker. The Comets logger inherits root=ERROR, so the prior warning was
suppressed in production -- a latent bug in the vectorized path could have
silently run the slow per-comet path (~minute of CPU) every cycle.
- Drop the unused "row"/"orbital_elements" keys from both result dicts
(nothing reads them; the vectorized path no longer builds a discarded
per-comet Series, and the two paths now share an identical schema).
- _calc_comets_vectorized: early-return {} on an empty dataframe (self-safe;
calc_comets already guards before calling it).
- Correct the docstring's account of why the old loop pegged a core: the
5s-while-uninitialised timer + _task_lock requeue, not the 293s interval.
- Add tests for the N==1 squeeze branch, the vectorized->per-comet fallback,
and comet_names filtering (previously uncovered).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The comet tests loaded the production comet file (PiFinder.utils.comet_file -> astro_data/comets.txt), which is gitignored and only present on a live unit after a download. CI therefore had no comet data and 6 of 7 tests failed with FileNotFoundError. Check in a real MPC CometEls.txt snapshot as a test fixture under python/tests/data/comets.txt — deliberately *outside* astro_data/ so no production code path reads it. An autouse fixture monkeypatches comets.comet_file onto it for the test module only, leaving the real download path untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the file's test-only nature obvious at a glance and avoid any confusion with the production astro_data/comets.txt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Owner
Author
|
Heads-up for reviewers: the follow-up real-hardware endurance test found a second, distinct failure mode — an IMU-process memory leak (~16 MB/min) — now believed to be the primary driver of the OOM/freeze this PR was originally credited with fixing. Fixed in #472 (which also rides a small contrast-reserve log-spam guard). I've updated the description above to match. TL;DR: this PR fixes the comet CPU hog; #472 fixes the memory leak; both are needed for a healthy long observing session. |
Collaborator
|
looks good, great catch! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
During observing — whenever PiFinder has a solve / GPS lock — the comet catalog continuously pegs a CPU core. Root cause (py-spy profiled + ablation-proven in a prior session):
calc_cometsadvanced all 957 comets through a per-comet skyfield Kepler-propagation loop (2.at(t)calls each, ~63–70 s total). Worse, while the catalog was stillinitialized=False, the 5 sTimerMixintick kept queueingdo_timed_taskcalls on_task_lockduring each multi-second init, so they ran back-to-back continuously. Being pure-Python skyfield, it held the GIL and starved the UI loop. Regressed by4aafd89a("replace incremental comet update with full reinit", #391).Sustained one-core load (heat + GIL-held UI starvation) makes observing sluggish, and was initially diagnosed as the cause of the whole-OS freeze. The endurance follow-up (#472) revised that: the hard freeze was primarily a memory leak in the IMU process. The comet hog is a real, distinct pathology and a compounding factor — but this PR alone does not resolve the OOM/freeze.
Fix
Replace the per-comet loop with a single batched numpy/skyfield propagation. Skyfield 1.45's
propagate()is array-aware andmpc._comet_orbits()builds oneKeplerOrbitover all rows, so every comet is advanced in one call.ts.J2000.Mto reproduce the oldradec(ts.J2000)framing exactly.~(mag > 15)to preserve the old quirk of keeping unknown/NaN-magnitude comets (nan > 15is False).process_cometis retained as a tolerant fallback (one bad MPC row can't break the whole batch) and as the test oracle.initializedflips immediately and the cadence settles to the intended 293 s.Validation
calc_comets~63–70 s → 0.13 s (~500×), output byte-identical: max RA/Dec separation 0.0000″, |Δmag| = 0, distances match to ~1e-14 AU, identical comet set.stumpff/propagate/_atdominating to no comet/skyfield frames.Tests
tests/test_comets.pypins the vectorized path to the per-comet oracle (RA/Dec/mag/distance parity over a spread of eccentricities — also guards against a future skyfield release breaking the batched call) and asserts the NaN-magnitude inclusion quirk and the no-location empty-result path.ruff check/ruff formatclean; 430 unit tests pass (incl. 4 new); mypy clean on the changed files.Related
🤖 Generated with Claude Code