fix(submissions): prevent silent submission losses by AybH26 · Pull Request #2408 · codalab/codabench

AybH26 · 2026-06-15T09:53:02Z

Title

fix(submissions): prevent silent submission losses

Summary

Fixes the silent submission-loss incident first reported in 2025 (~30 %
of submissions stuck in `Submitting`/`Submitted`/`Scoring` with no
traceable error).

Root causes

The incident wasn't one bug — it was a chain of five compounding issues
across the Django site worker, Celery beat, and the compute worker.
Each could silently drop a submission on its own; combined they turned
idle traffic into a ~30 % loss rate.

ID	Layer	Bug
F1	site_worker	`watchmedo` wrapped Celery prefork → SIGTERM not forwarded → channel torn down without acking the in-flight `run_submission`
F2	beat	`CELERY_BEAT_SCHEDULE` published `refresh_compute_worker_health` (does not exist) → `KeyError` killed the consumer channel
F3	site_worker	Memory limit 256 MB was OOM-killed under prefork(2) + beat → orphan `Submitting` rows, `celery_task_id=None`
M1.A	Django watchdog	No recovery path for submissions stuck in `Scoring`/`Running` after a worker crash — they stayed stuck forever
M1.B	compute_worker	(a) `urllib3.Retry` excludes PATCH by default — the final `PATCH status=Finished` was never retried. (b) `_update_status` swallowed every exception, so `task_acks_late=True` acked tasks whose terminal update had failed.

Changes

F1 — `docker-compose.yml`

Drop the watchmedo wrapper around the Celery prefork pool on site_worker.
Subscribe to both site-worker and the default celery queue so beat-published tasks land where the worker consumes.

F2 — `src/settings/base.py`

Comment out the refresh_compute_worker_health beat entry (task does not exist anywhere in the codebase).

F3 — `docker-compose.yml`

Raise site_worker memory limit from 256M to 1G.

M1.A — `src/apps/competitions/tasks.py` + `src/settings/base.py`

New reaper_stuck_scoring Celery task on site-worker queue, soft limit 5 min.
Re-dispatches any submission stuck in Scoring/Running past started_when + execution_time_limit + threshold_minutes.
Scheduled every 5 min in CELERY_BEAT_SCHEDULE with threshold_minutes=30.
Annotates Submission.status_details with a reaper marker for /admin and API visibility.

M1.B — `compute_worker/compute_worker.py`

requests session adapter: Retry(total=5, backoff_factor=2, status_forcelist=(500,502,503,504), allowed_methods=frozenset((..., PATCH))).
_update_status now re-raises when status == FINISHED so Celery (with task_acks_late=True) requeues the job.
Intermediate states (Running, Scoring) stay best-effort to avoid killing a scoring run for a transient glitch.
Failed already has a raise in every caller, so the requeue path is covered.

Didayolo · 2026-06-15T09:56:39Z

Hi @AybH26,

Thank you for your contribution.

What was the protocol used to reproduce the problem, and to confirm that this change solves it?

To my knowledge, this problem was solved by #2223.

IdirLISN · 2026-06-15T10:01:35Z

"CELERY_BEAT_SCHEDULE published refresh_compute_worker_health (does not exist) → KeyError killed the consumer channel"

This is correct, it's an artefact from the last compute worker monitoring deployment and this is fixed in the next PR for worker monitoring (#2395).

fix(submissions): prevent silent submission losses

b57a0ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(submissions): prevent silent submission losses#2408

fix(submissions): prevent silent submission losses#2408
AybH26 wants to merge 1 commit into
codalab:developfrom
AybH26:develop

AybH26 commented Jun 15, 2026 •

edited

Loading

Uh oh!

Didayolo commented Jun 15, 2026

Uh oh!

IdirLISN commented Jun 15, 2026 •

edited by Didayolo

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AybH26 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Title

Summary

Fixes the silent submission-loss incident first reported in 2025 (~30 % of submissions stuck in Submitting/Submitted/Scoring with no traceable error).

Root causes

Changes

F1 — docker-compose.yml

F2 — src/settings/base.py

F3 — docker-compose.yml

M1.A — src/apps/competitions/tasks.py + src/settings/base.py

M1.B — compute_worker/compute_worker.py

Uh oh!

Didayolo commented Jun 15, 2026

Uh oh!

IdirLISN commented Jun 15, 2026 • edited by Didayolo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AybH26 commented Jun 15, 2026 •

edited

Loading

Fixes the silent submission-loss incident first reported in 2025 (~30 %
of submissions stuck in `Submitting`/`Submitted`/`Scoring` with no
traceable error).

F1 — `docker-compose.yml`

F2 — `src/settings/base.py`

F3 — `docker-compose.yml`

M1.A — `src/apps/competitions/tasks.py` + `src/settings/base.py`

M1.B — `compute_worker/compute_worker.py`

IdirLISN commented Jun 15, 2026 •

edited by Didayolo

Loading