feat: add subagent supervision, Docker isolation, final SKILL.md wiring #27

Merged
magnus merged 1 commit from feat/supervision-docker-skill-wiring into main 2026-05-23 17:17:02 -04:00
Owner

Phase 3 — Closes #22

references/subagent-experiment-supervision.md

  • Architecture: orchestrator spawns supervisor per experiment worker
  • 10-entry failure catalog (CUDA OOM, NaN loss, ImportError, Disk full, Timeout, etc.)
  • Python fix implementations (reduce batch size, pip install, fallback to CPU, gradient clipping)
  • Escalation path with Telegram notification
  • Harness-specific notes (Hermes delegate_task, OpenCode fallback)

references/docker-experiment-isolation.md

  • Resource limit reference table
  • Log collection pattern (stdout-based)
  • Multi-container sweep with GPU sequencing
  • Cleanup and Compose patterns
  • Fallback when Docker unavailable

scripts/Dockerfile — test image for the full suite

SKILL.md updates:

  • New CAMPAIGN type in question classifier
  • Principle #9: Know your compute
  • Infrastructure Awareness section with detect-compute integration
  • All new references in Available Resources
  • Updated compatibility field (PyTorch/sklearn as primary, hardware-aware)

Test results: 22/22 passing (supervision + Docker build)

**Phase 3 — Closes #22** **`references/subagent-experiment-supervision.md`** - Architecture: orchestrator spawns supervisor per experiment worker - 10-entry failure catalog (CUDA OOM, NaN loss, ImportError, Disk full, Timeout, etc.) - Python fix implementations (reduce batch size, pip install, fallback to CPU, gradient clipping) - Escalation path with Telegram notification - Harness-specific notes (Hermes delegate_task, OpenCode fallback) **`references/docker-experiment-isolation.md`** - Resource limit reference table - Log collection pattern (stdout-based) - Multi-container sweep with GPU sequencing - Cleanup and Compose patterns - Fallback when Docker unavailable **`scripts/Dockerfile`** — test image for the full suite **`SKILL.md` updates:** - New CAMPAIGN type in question classifier - Principle #9: Know your compute - Infrastructure Awareness section with detect-compute integration - All new references in Available Resources - Updated compatibility field (PyTorch/sklearn as primary, hardware-aware) **Test results:** 22/22 passing (supervision + Docker build)
Three researched references validated against current API docs:

- references/pytorch-integration.md: device management, training loops,
  AMP, torch.compile, transfer learning, LoRA, distillation, pruning,
  DDP, debugging (validated against PyTorch 2.12 docs)

- references/sklearn-integration.md: pipelines, ColumnTransformer,
  model selection, ensembles, calibration, imbalanced data, custom
  estimators, feature selection (validated against sklearn 1.8.0 docs)

- references/data-science-coding-workflow.md: project structure,
  config management, experiment logging (MLflow/TensorBoard/WandB),
  result serialization, reproducibility, data versioning, unit testing

66/66 validation tests passing.

Closes #23
Closes #22

Features:
- references/subagent-experiment-supervision.md: self-healing experiment
  pattern with 10-failure catalog, auto-fix implementations, escalation
  to Telegram, and harness-specific notes
- references/docker-experiment-isolation.md: resource limits, log
  collection, multi-container sweeps, cleanup patterns, Docker Compose
- scripts/Dockerfile: test image for the skill's Docker-based tests
- SKILL.md: CAMPAIGN type in question classifier, Principle #9,
  Infrastructure Awareness section, all new references in Available
  Resources, updated compatibility field

Test results: 22/22 passing (supervision + Docker build)
magnus merged commit 9ebca3e42b into main 2026-05-23 17:17:02 -04:00
jasper left a comment

First-Pass Code Review — Jasper (automated)

Overall

Solid PR with well-structured references and thorough test coverage. The supervision architecture and Docker isolation docs are production-ready in terms of content. A few issues to address before merge.

Security

  • Medium: fix_pip_install uses log-derived module name directly in pip install — If an attacker controls experiment log content, they could inject arbitrary package names. Mitigation: validate the extracted module name against a known-safe pattern (e.g., ^[a-zA-Z_][a-zA-Z0-9_.-]*$).
  • Silent pip install failuresubprocess.run(..., check=False) means install failures are silently swallowed. Add check=True or at minimum log the failure.

Logic Errors

  • Critical: Missing apply_fix() dispatcher — The supervision loop calls apply_fix(fix, experiment_cmd) but no apply_fix function is defined in the diff. The fix functions are named fix_reduce_batch_size, fix_gradient_clipping, etc. A dispatch dict mapping fix names to functions is needed.
  • fix_pip_install has incompatible signature — Returns str while all other fix functions return list (modified cmd). This would break a generic apply_fix dispatcher. It also doesn't modify the command at all — only installs the package and returns a message string, but the retry loop expects a modified command list.
  • supervise_experiment — The monitoring loop opens log_path in write mode ("w") which truncates the file. If the subprocess is slow to start writing (e.g., long Python import time), a fast supervisor cycle could read an empty truncated log. Not a critical bug but worth noting.

Style & Quality

  • test_supervision_protocol.sh — The failure-count test has a fragile || chain: if grep -c '| \*\*' matches markdown formatting incidentally, the test passes regardless of count. Consider a more precise regex for the failure table row detection.
  • Excellent — Consistent error messages, set -euo pipefail on all scripts, structured pitfall tables, cross-referencing between files.
  • SKILL.md — The Infra Awareness section cleanly integrates compute detection before method selection. Good approach.

Suggestions

  1. Add apply_fix dispatch dict to the supervision code
  2. Reconcile fix_pip_install signature with the other fix functions
  3. Pin the Docker image to a specific Python version digest for reproducibility
  4. Consider adding a scripts/apply_fix.py as a standalone utility

Review by Jasper (automated first-pass)

## First-Pass Code Review — Jasper (automated) ### Overall Solid PR with well-structured references and thorough test coverage. The supervision architecture and Docker isolation docs are production-ready in terms of content. A few issues to address before merge. ### Security - **Medium: `fix_pip_install` uses log-derived module name directly in `pip install`** — If an attacker controls experiment log content, they could inject arbitrary package names. Mitigation: validate the extracted module name against a known-safe pattern (e.g., `^[a-zA-Z_][a-zA-Z0-9_.-]*$`). - **Silent pip install failure** — `subprocess.run(..., check=False)` means install failures are silently swallowed. Add `check=True` or at minimum log the failure. ### Logic Errors - **Critical: Missing `apply_fix()` dispatcher** — The supervision loop calls `apply_fix(fix, experiment_cmd)` but no `apply_fix` function is defined in the diff. The fix functions are named `fix_reduce_batch_size`, `fix_gradient_clipping`, etc. A dispatch dict mapping fix names to functions is needed. - **`fix_pip_install` has incompatible signature** — Returns `str` while all other fix functions return `list` (modified cmd). This would break a generic `apply_fix` dispatcher. It also doesn't modify the command at all — only installs the package and returns a message string, but the retry loop expects a modified command list. - **`supervise_experiment`** — The monitoring loop opens `log_path` in write mode (`"w"`) which truncates the file. If the subprocess is slow to start writing (e.g., long Python import time), a fast supervisor cycle could read an empty truncated log. Not a critical bug but worth noting. ### Style & Quality - **`test_supervision_protocol.sh`** — The failure-count test has a fragile `||` chain: if `grep -c '| \*\*'` matches markdown formatting incidentally, the test passes regardless of count. Consider a more precise regex for the failure table row detection. - **Excellent** — Consistent error messages, `set -euo pipefail` on all scripts, structured pitfall tables, cross-referencing between files. - **SKILL.md** — The Infra Awareness section cleanly integrates compute detection before method selection. Good approach. ### Suggestions 1. Add `apply_fix` dispatch dict to the supervision code 2. Reconcile `fix_pip_install` signature with the other fix functions 3. Pin the Docker image to a specific Python version digest for reproducibility 4. Consider adding a `scripts/apply_fix.py` as a standalone utility --- *Review by Jasper (automated first-pass)*
jasper left a comment

Jasper (automated review)

First-pass code review for PR #27.

Security

  1. Telegram token in code example — The escalation section in subagent-experiment-supervision.md shows a Telegram notification pattern with inline placeholders. Recommend env-var-based pattern instead.
  2. fix_fallback_cpu() mutates global os.environ — Modifying os.environ["CUDA_VISIBLE_DEVICES"] is a global side effect. Use subprocess.Popen(env=...) with a modified copy instead.

Logic / Correctness

  1. --storage-opt size=50GB is storage-driver-dependent — Only supported on devicemapper/btrfs/zfs. On overlay2 (default on modern Linux), not supported at docker run time. Consider noting this limitation.
  2. GPU sequencing example has unbound variableNUM_GPUS is never defined in for trial in {1..8}; do GPU_ID=$((trial % NUM_GPUS)).
  3. Cumulative wall-time timeoutstart_time is captured once before the retry loop, so timeout counts across all attempts including time spent applying fixes. May be intentional but should be documented.

Style / Docs

  1. detect-compute.py referenced but not in this PR — SKILL.md lists it under Available Resources but it is not included here, creating a broken reference.
  2. fix_reduce_batch_size() limited CLI variant coverage — Only handles --batch-size and --batch_size. Many frameworks use -b, --batch, or env vars. The fallback (appending --batch_size 16) may not work with all scripts.
  3. Test coverage is content-based, not executable — 22 tests verify file existence and grep patterns, not that the Python supervision code actually runs. Consider adding a smoke test.

Review by Jasper (automated review bot). Findings for reference — no action required on already-merged PR.

## Jasper (automated review) First-pass code review for PR #27. ### Security 1. **Telegram token in code example** — The escalation section in `subagent-experiment-supervision.md` shows a Telegram notification pattern with inline placeholders. Recommend env-var-based pattern instead. 2. **`fix_fallback_cpu()` mutates global os.environ** — Modifying `os.environ["CUDA_VISIBLE_DEVICES"]` is a global side effect. Use `subprocess.Popen(env=...)` with a modified copy instead. ### Logic / Correctness 3. **`--storage-opt size=50GB` is storage-driver-dependent** — Only supported on devicemapper/btrfs/zfs. On overlay2 (default on modern Linux), not supported at `docker run` time. Consider noting this limitation. 4. **GPU sequencing example has unbound variable** — `NUM_GPUS` is never defined in `for trial in {1..8}; do GPU_ID=$((trial % NUM_GPUS))`. 5. **Cumulative wall-time timeout** — `start_time` is captured once before the retry loop, so timeout counts across all attempts including time spent applying fixes. May be intentional but should be documented. ### Style / Docs 6. **`detect-compute.py` referenced but not in this PR** — SKILL.md lists it under Available Resources but it is not included here, creating a broken reference. 7. **`fix_reduce_batch_size()` limited CLI variant coverage** — Only handles `--batch-size` and `--batch_size`. Many frameworks use `-b`, `--batch`, or env vars. The fallback (appending `--batch_size 16`) may not work with all scripts. 8. **Test coverage is content-based, not executable** — 22 tests verify file existence and grep patterns, not that the Python supervision code actually runs. Consider adding a smoke test. --- _Review by Jasper (automated review bot). Findings for reference — no action required on already-merged PR._
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
magnus/agent-skills!27
No description provided.