magnus/agent-skills

Fork 0

feat: add subagent supervision, Docker isolation, final SKILL.md wiring #27

Merged

magnus merged 1 commit from feat/supervision-docker-skill-wiring into main

2026-05-23 17:17:02 -04:00

magnus commented

2026-05-23 17:12:53 -04:00

Owner

Phase 3 — Closes #22

references/subagent-experiment-supervision.md

Architecture: orchestrator spawns supervisor per experiment worker
10-entry failure catalog (CUDA OOM, NaN loss, ImportError, Disk full, Timeout, etc.)
Python fix implementations (reduce batch size, pip install, fallback to CPU, gradient clipping)
Escalation path with Telegram notification
Harness-specific notes (Hermes delegate_task, OpenCode fallback)

references/docker-experiment-isolation.md

Resource limit reference table
Log collection pattern (stdout-based)
Multi-container sweep with GPU sequencing
Cleanup and Compose patterns
Fallback when Docker unavailable

scripts/Dockerfile — test image for the full suite

SKILL.md updates:

New CAMPAIGN type in question classifier
Principle #9: Know your compute
Infrastructure Awareness section with detect-compute integration
All new references in Available Resources
Updated compatibility field (PyTorch/sklearn as primary, hardware-aware)

Test results: 22/22 passing (supervision + Docker build)

**Phase 3 — Closes #22** **`references/subagent-experiment-supervision.md`** - Architecture: orchestrator spawns supervisor per experiment worker - 10-entry failure catalog (CUDA OOM, NaN loss, ImportError, Disk full, Timeout, etc.) - Python fix implementations (reduce batch size, pip install, fallback to CPU, gradient clipping) - Escalation path with Telegram notification - Harness-specific notes (Hermes delegate_task, OpenCode fallback) **`references/docker-experiment-isolation.md`** - Resource limit reference table - Log collection pattern (stdout-based) - Multi-container sweep with GPU sequencing - Cleanup and Compose patterns - Fallback when Docker unavailable **`scripts/Dockerfile`** — test image for the full suite **`SKILL.md` updates:** - New CAMPAIGN type in question classifier - Principle #9: Know your compute - Infrastructure Awareness section with detect-compute integration - All new references in Available Resources - Updated compatibility field (PyTorch/sklearn as primary, hardware-aware) **Test results:** 22/22 passing (supervision + Docker build)

magnus added 3 commits

2026-05-23 17:12:53 -04:00

chore: resolve merge conflict after pulling origin main (data-scientist merged, epub added locally) 5ed56721c6

feat: add researched code integration references — PyTorch, sklearn, DS workflow dc08b67203

Three researched references validated against current API docs:

- references/pytorch-integration.md: device management, training loops,
  AMP, torch.compile, transfer learning, LoRA, distillation, pruning,
  DDP, debugging (validated against PyTorch 2.12 docs)

- references/sklearn-integration.md: pipelines, ColumnTransformer,
  model selection, ensembles, calibration, imbalanced data, custom
  estimators, feature selection (validated against sklearn 1.8.0 docs)

- references/data-science-coding-workflow.md: project structure,
  config management, experiment logging (MLflow/TensorBoard/WandB),
  result serialization, reproducibility, data versioning, unit testing

66/66 validation tests passing.

Closes #23

feat: add subagent supervision, Docker isolation, final SKILL.md wiring b601ba0f13

Closes #22

Features:
- references/subagent-experiment-supervision.md: self-healing experiment
  pattern with 10-failure catalog, auto-fix implementations, escalation
  to Telegram, and harness-specific notes
- references/docker-experiment-isolation.md: resource limits, log
  collection, multi-container sweeps, cleanup patterns, Docker Compose
- scripts/Dockerfile: test image for the skill's Docker-based tests
- SKILL.md: CAMPAIGN type in question classifier, Principle #9,
  Infrastructure Awareness section, all new references in Available
  Resources, updated compatibility field

Test results: 22/22 passing (supervision + Docker build)

magnus referenced this pull request

2026-05-23 17:13:06 -04:00

Data-Scientist Skill v2: Experimental Campaign Pipeline + Infrastructure Awareness + Subagent Supervision #22

magnus merged commit 9ebca3e42b into main

2026-05-23 17:17:02 -04:00

magnus referenced this pull request from a commit

2026-05-23 17:17:03 -04:00

Merge pull request 'feat: add subagent supervision, Docker isolation, final SKILL.md wiring' (#27) from feat/supervision-docker-skill-wiring into main

jasper reviewed

2026-05-23 17:30:24 -04:00

jasper left a comment

First-Pass Code Review — Jasper (automated)

Overall

Solid PR with well-structured references and thorough test coverage. The supervision architecture and Docker isolation docs are production-ready in terms of content. A few issues to address before merge.

Security

Medium: fix_pip_install uses log-derived module name directly in pip install — If an attacker controls experiment log content, they could inject arbitrary package names. Mitigation: validate the extracted module name against a known-safe pattern (e.g., ^[a-zA-Z_][a-zA-Z0-9_.-]*$).
Silent pip install failure — subprocess.run(..., check=False) means install failures are silently swallowed. Add check=True or at minimum log the failure.

Logic Errors

Critical: Missing apply_fix() dispatcher — The supervision loop calls apply_fix(fix, experiment_cmd) but no apply_fix function is defined in the diff. The fix functions are named fix_reduce_batch_size, fix_gradient_clipping, etc. A dispatch dict mapping fix names to functions is needed.
fix_pip_install has incompatible signature — Returns str while all other fix functions return list (modified cmd). This would break a generic apply_fix dispatcher. It also doesn't modify the command at all — only installs the package and returns a message string, but the retry loop expects a modified command list.
supervise_experiment — The monitoring loop opens log_path in write mode ("w") which truncates the file. If the subprocess is slow to start writing (e.g., long Python import time), a fast supervisor cycle could read an empty truncated log. Not a critical bug but worth noting.

Style & Quality

test_supervision_protocol.sh — The failure-count test has a fragile || chain: if grep -c '| \*\*' matches markdown formatting incidentally, the test passes regardless of count. Consider a more precise regex for the failure table row detection.
Excellent — Consistent error messages, set -euo pipefail on all scripts, structured pitfall tables, cross-referencing between files.
SKILL.md — The Infra Awareness section cleanly integrates compute detection before method selection. Good approach.

Suggestions

Add apply_fix dispatch dict to the supervision code
Reconcile fix_pip_install signature with the other fix functions
Pin the Docker image to a specific Python version digest for reproducibility
Consider adding a scripts/apply_fix.py as a standalone utility

Review by Jasper (automated first-pass)

## First-Pass Code Review — Jasper (automated) ### Overall Solid PR with well-structured references and thorough test coverage. The supervision architecture and Docker isolation docs are production-ready in terms of content. A few issues to address before merge. ### Security - **Medium: `fix_pip_install` uses log-derived module name directly in `pip install`** — If an attacker controls experiment log content, they could inject arbitrary package names. Mitigation: validate the extracted module name against a known-safe pattern (e.g., `^[a-zA-Z_][a-zA-Z0-9_.-]*$`). - **Silent pip install failure** — `subprocess.run(..., check=False)` means install failures are silently swallowed. Add `check=True` or at minimum log the failure. ### Logic Errors - **Critical: Missing `apply_fix()` dispatcher** — The supervision loop calls `apply_fix(fix, experiment_cmd)` but no `apply_fix` function is defined in the diff. The fix functions are named `fix_reduce_batch_size`, `fix_gradient_clipping`, etc. A dispatch dict mapping fix names to functions is needed. - **`fix_pip_install` has incompatible signature** — Returns `str` while all other fix functions return `list` (modified cmd). This would break a generic `apply_fix` dispatcher. It also doesn't modify the command at all — only installs the package and returns a message string, but the retry loop expects a modified command list. - **`supervise_experiment`** — The monitoring loop opens `log_path` in write mode (`"w"`) which truncates the file. If the subprocess is slow to start writing (e.g., long Python import time), a fast supervisor cycle could read an empty truncated log. Not a critical bug but worth noting. ### Style & Quality - **`test_supervision_protocol.sh`** — The failure-count test has a fragile `||` chain: if `grep -c '| \*\*'` matches markdown formatting incidentally, the test passes regardless of count. Consider a more precise regex for the failure table row detection. - **Excellent** — Consistent error messages, `set -euo pipefail` on all scripts, structured pitfall tables, cross-referencing between files. - **SKILL.md** — The Infra Awareness section cleanly integrates compute detection before method selection. Good approach. ### Suggestions 1. Add `apply_fix` dispatch dict to the supervision code 2. Reconcile `fix_pip_install` signature with the other fix functions 3. Pin the Docker image to a specific Python version digest for reproducibility 4. Consider adding a `scripts/apply_fix.py` as a standalone utility --- *Review by Jasper (automated first-pass)*

jasper reviewed

2026-05-23 17:34:40 -04:00

jasper left a comment

Jasper (automated review)

First-pass code review for PR #27.

Security

Telegram token in code example — The escalation section in subagent-experiment-supervision.md shows a Telegram notification pattern with inline placeholders. Recommend env-var-based pattern instead.
fix_fallback_cpu() mutates global os.environ — Modifying os.environ["CUDA_VISIBLE_DEVICES"] is a global side effect. Use subprocess.Popen(env=...) with a modified copy instead.

Logic / Correctness

--storage-opt size=50GB is storage-driver-dependent — Only supported on devicemapper/btrfs/zfs. On overlay2 (default on modern Linux), not supported at docker run time. Consider noting this limitation.
GPU sequencing example has unbound variable — NUM_GPUS is never defined in for trial in {1..8}; do GPU_ID=$((trial % NUM_GPUS)).
Cumulative wall-time timeout — start_time is captured once before the retry loop, so timeout counts across all attempts including time spent applying fixes. May be intentional but should be documented.

Style / Docs

detect-compute.py referenced but not in this PR — SKILL.md lists it under Available Resources but it is not included here, creating a broken reference.
fix_reduce_batch_size() limited CLI variant coverage — Only handles --batch-size and --batch_size. Many frameworks use -b, --batch, or env vars. The fallback (appending --batch_size 16) may not work with all scripts.
Test coverage is content-based, not executable — 22 tests verify file existence and grep patterns, not that the Python supervision code actually runs. Consider adding a smoke test.

Review by Jasper (automated review bot). Findings for reference — no action required on already-merged PR.

## Jasper (automated review) First-pass code review for PR #27. ### Security 1. **Telegram token in code example** — The escalation section in `subagent-experiment-supervision.md` shows a Telegram notification pattern with inline placeholders. Recommend env-var-based pattern instead. 2. **`fix_fallback_cpu()` mutates global os.environ** — Modifying `os.environ["CUDA_VISIBLE_DEVICES"]` is a global side effect. Use `subprocess.Popen(env=...)` with a modified copy instead. ### Logic / Correctness 3. **`--storage-opt size=50GB` is storage-driver-dependent** — Only supported on devicemapper/btrfs/zfs. On overlay2 (default on modern Linux), not supported at `docker run` time. Consider noting this limitation. 4. **GPU sequencing example has unbound variable** — `NUM_GPUS` is never defined in `for trial in {1..8}; do GPU_ID=$((trial % NUM_GPUS))`. 5. **Cumulative wall-time timeout** — `start_time` is captured once before the retry loop, so timeout counts across all attempts including time spent applying fixes. May be intentional but should be documented. ### Style / Docs 6. **`detect-compute.py` referenced but not in this PR** — SKILL.md lists it under Available Resources but it is not included here, creating a broken reference. 7. **`fix_reduce_batch_size()` limited CLI variant coverage** — Only handles `--batch-size` and `--batch_size`. Many frameworks use `-b`, `--batch`, or env vars. The fallback (appending `--batch_size 16`) may not work with all scripts. 8. **Test coverage is content-based, not executable** — 22 tests verify file existence and grep patterns, not that the Python supervision code actually runs. Consider adding a smoke test. --- _Review by Jasper (automated review bot). Findings for reference — no action required on already-merged PR._

No reviewers

No labels

community-feedback

enhancement

skill-upgrade

No milestone

No project

No assignees

2 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

magnus/agent-skills!27

No description provided.

Rows
Columns