Skip to content

fix(submissions): prevent silent submission losses#2408

Open
AybH26 wants to merge 1 commit into
codalab:developfrom
AybH26:develop
Open

fix(submissions): prevent silent submission losses#2408
AybH26 wants to merge 1 commit into
codalab:developfrom
AybH26:develop

Conversation

@AybH26

@AybH26 AybH26 commented Jun 15, 2026

Copy link
Copy Markdown

Title

fix(submissions): prevent silent submission losses

Summary

Fixes the silent submission-loss incident first reported in 2025 (~30 %
of submissions stuck in Submitting/Submitted/Scoring with no
traceable error).

Root causes

The incident wasn't one bug — it was a chain of five compounding issues
across the Django site worker, Celery beat, and the compute worker.
Each could silently drop a submission on its own; combined they turned
idle traffic into a ~30 % loss rate.

ID Layer Bug
F1 site_worker watchmedo wrapped Celery prefork → SIGTERM not forwarded → channel torn down without acking the in-flight run_submission
F2 beat CELERY_BEAT_SCHEDULE published refresh_compute_worker_health (does not exist) → KeyError killed the consumer channel
F3 site_worker Memory limit 256 MB was OOM-killed under prefork(2) + beat → orphan Submitting rows, celery_task_id=None
M1.A Django watchdog No recovery path for submissions stuck in Scoring/Running after a worker crash — they stayed stuck forever
M1.B compute_worker (a) urllib3.Retry excludes PATCH by default — the final PATCH status=Finished was never retried. (b) _update_status swallowed every exception, so task_acks_late=True acked tasks whose terminal update had failed.

Changes

F1 — docker-compose.yml

  • Drop the watchmedo wrapper around the Celery prefork pool on site_worker.
  • Subscribe to both site-worker and the default celery queue so beat-published tasks land where the worker consumes.

F2 — src/settings/base.py

  • Comment out the refresh_compute_worker_health beat entry (task does not exist anywhere in the codebase).

F3 — docker-compose.yml

  • Raise site_worker memory limit from 256M to 1G.

M1.A — src/apps/competitions/tasks.py + src/settings/base.py

  • New reaper_stuck_scoring Celery task on site-worker queue, soft limit 5 min.
  • Re-dispatches any submission stuck in Scoring/Running past started_when + execution_time_limit + threshold_minutes.
  • Scheduled every 5 min in CELERY_BEAT_SCHEDULE with threshold_minutes=30.
  • Annotates Submission.status_details with a reaper marker for /admin and API visibility.

M1.B — compute_worker/compute_worker.py

  • requests session adapter: Retry(total=5, backoff_factor=2, status_forcelist=(500,502,503,504), allowed_methods=frozenset((..., PATCH))).
  • _update_status now re-raises when status == FINISHED so Celery (with task_acks_late=True) requeues the job.
    Intermediate states (Running, Scoring) stay best-effort to avoid killing a scoring run for a transient glitch.
    Failed already has a raise in every caller, so the requeue path is covered.

@Didayolo

Copy link
Copy Markdown
Member

Hi @AybH26,

Thank you for your contribution.

What was the protocol used to reproduce the problem, and to confirm that this change solves it?

To my knowledge, this problem was solved by #2223.

@IdirLISN

IdirLISN commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

"CELERY_BEAT_SCHEDULE published refresh_compute_worker_health (does not exist) → KeyError killed the consumer channel"

This is correct, it's an artefact from the last compute worker monitoring deployment and this is fixed in the next PR for worker monitoring (#2395).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants