Experimental performance family (default-off): load-balance infrastructure, active-box windowing, block-structured AMR, hybrid WENO/Riemann sensors#1628
Draft
sbryngelson wants to merge 89 commits into
Conversation
…, correct growth comment)
…inal-review fixes)
…n at the c/f boundary
…on (regrid-ready, behavior-identical)
…, in-place rebuild with overlap copy
…tate-based reflux apply (inert)
…, time-interpolated ghosts, accumulated reflux
…validated with fine-level advance)
…ex sweep, owner-guarded fine machinery
…lamp, owner-only rebuild
…havior-identical)
…bracketed coupling (M1); lift GPU prohibit
…p, golden coverage
…sters, on-device ghost fill/RK/restriction
…fine halo exchange, distributed flux registers
…anning patches move freely; drop owner-window model
…alance coupling, deterministic feasibility clamp, fine-advance rank timing
# Conflicts: # src/common/m_mpi_common.fpp
…tes + 3D golden case
…m-preserving alpha prolongation, mpp_lim-gated consistency
…ersistence (SP10)
…ay); num_patches=1 behavior-identical
…ion merge, per-slot advance (SP12a)
…sambiguate from IC patches; golden values unchanged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds an opt-in, default-off family of performance features and the measurement
infrastructure they rest on. With all flags at their defaults the only touched production
path is
s_mpi_decompose_computational_domain, refactored to compute its equal splitthrough the new
m_boxmodule (byte-identical; covered by the existing suite).Load-balance infrastructure (common + sim):
m_box:t_box+ partition arithmetic; shared by the decomposer, AMR, and theweighted splitter.
m_load_weight+load_weight_wrt: per-cell load-weight field (active-box, EL-bubble,IB, phase-change Newton-iteration contributors) with field output and a per-rank
imbalance metric.
m_sfc_partition+sfc_partition_wrt: Morton-SFC tile ordering and chains-on-chainsbalanced partition, reported as a predicted-imbalance diagnostic.
m_load_balance+load_balance: experimental weighted static Cartesian decompositionat init (requires
parallel_io), with a min-cells feasibility floor and, when AMR ison, fine-work-aware weighting with a deterministic feasibility clamp.
m_rank_timing+rank_time_wrt: per-rank compute-time diagnostic (halo exchangeexcluded; device-synced on GPU).
Active-box windowing (sim):
m_active_box+active_box: restricts reconstruction/Riemann/RK windows to alight-cone-grown box around non-ambient flow; a debug tripwire guards under-growth.
Golden-tested (
ECABA006) to stay a strict subset while matching the full-domainsolution.
Block-structured AMR (sim):
m_amr+m_amr_registers+amr: a two-level 2:1 refined hierarchy withconservative restriction / conservative-linear prolongation, per-stage flux registers
with Berger–Colella refluxing, gradient-based dynamic regrid (
amr_regrid_int,amr_tag_eps,amr_buf), optional dt/2 subcycling (amr_subcycle), multi-rankoperation with a mirror-decomposed fine level (patches may span rank boundaries; fine
halo exchange; distributed flux registers; rank-local regrid), and GPU builds
(device-resident fine fields and registers, on-device ghost fill/RK/restriction).
Requires WENO, SSP-RK3, model_eqns=2, single fluid (checker-enforced).
Hybrid reconstruction/flux sensors (sim):
hybrid_weno(+hybrid_weno_eps): linear-optimal reconstruction in smooth cells, fullWENO only at flagged discontinuities (Jameson-type density+pressure sensor,
stencil-dilated, halo-aware).
hybrid_riemann(+hybrid_smooth_flux): cheap central/Rusanov flux in smooth cells,full HLLC at discontinuities (5- and 6-equation blocks).
Motivation
Measured rank imbalance on heterogeneous-cost workloads (bubbles, IB, phase change)
motivates first-class measurement tools; the active box and hybrid sensors give direct
speedups on localized-flow / mostly-smooth cases; AMR concentrates resolution where the
flow needs it, and the load-balance coupling keeps the refined work spread across ranks.
Testing
5ECBB926(AMR static patch),1CBACEB5(AMR dynamic regrid),852CCB81(AMR subcycling),ECABA006(active_box 3D strict-subset).pass;
load_balance+amrnp=2 end-to-end smoke produces the analytically predictedweighted offsets and completes; amr np=2 spanning-patch run completes.
executable) verified locally (see PR checks for the full matrix).
case_validatorentries,case.mddocs, andmodule_categories are included.
Known-untested configurations
Delegated to CI: Cray ftn, Intel ifx, AMD flang, OpenMP target offload, single/mixed
precision. Hybrid WENO/Riemann ship without a dedicated golden case (flagging for
reviewer judgment; the sensors are default-off and checker-guarded).
Review guide
The 75 commits are arc-ordered and cleanly arc-separable — reviewing by arc is much
easier than by file:
2760da7d…2bb5fdc4active-box (11)bbf6b2a9…14b837c6load-weight field + contributors (8)0161fac0…2795e266SFC partition diagnostic (6)6df9c1f0…c43c02a5weighted decomposition (load_balance) (8)95398eb3…cc7882d1rank timing (4)21c60ffa…5082b535hybrid WENO/Riemann (10)74b58771…de244407m_box refactor + validation hygiene (4)352f564e…03b59516AMR: static hierarchy → restriction/prolongation → fine advance →refluxing → regrid → subcycling → multi-rank → GPU → mirror decomposition →
load-balance coupling (20)
a1a7e3admerge of upstream/master (num_procs_x/y/zpromotion adopted from Fix periodic ib issues #1618)Addendum: features added after the initial draft
mpp_limrequired fornum_fluids > 1; shock–material-interface demo validated. Known bounded limitation: alpha-sum deviation up to ~5.7e-3 at coarse cells historically hosting a patch face during shock crossing (non-growing, mpp_lim-damped; the volume-fraction K-term is deliberately not refluxed — it is non-conservative).Further additions
num_procsrequired (np-flexible restart is future work).viscousprohibition lifted; viscous stress/work refluxed through the existing registers (entersrhsas aflux_src_nface-flux difference, same form as advective flux) so coarse/fine boundaries match total flux; energy conservation 0.0, accuracy triplet coarse 2.49e-4 ≫ two-level 6.89e-5 ≈ fine 5.04e-5. A fine-ghost-coordinate bug (viscous gradient using stale coarsedxat the fine subdomain/patch edge — invisible to WENO, which uses only interiordx) was found by an np=2 exactness probe and fixed; the fine viscous seam is now byte-exact across ranks. Residual: a bounded (~1e-6) np-dependence remains only at the coarse/fine patch boundary from prolongation-derived ghost gradients (AMR's inherently-approximate coupling zone); the density-gradient tagger senses shear poorly (buffered/static patch recommended; error-estimator taggers are future work).Multi-block AMR + terminology
amr_max_blocks(default 4; N fixed-size slots, ~N× device memory — compute efficiency is the goal, memory efficiency a follow-up),amr_cluster_eff(default 0.7). Fine blocks stay ≥ buff_size apart ⇒ no fine–fine coupling; all existing per-block machinery (multi-rank, GPU, subcycle, viscous, multi-fluid) loops over the block list unchanged.amr_block_beg/end,amr_max_blocks) — disambiguated from MFC's initial-conditionpatch_icpp. (Draft-stage rename; golden values unchanged.)