Data-Scientist Skill v2: Experimental Campaign Pipeline + Infrastructure Awareness + Subagent Supervision #22

Closed
opened 2026-05-23 16:53:58 -04:00 by magnus · 3 comments
Owner

Feature Suite: Data-Scientist Skill Upgrades

Based on feedback from neopabo (Nous Research Discord) who reviewed the existing skill. Three interlocking features that transform the skill from a statistical reference into an experimental research conductor.


1. Experimental Campaign Protocol

The current skill is strong on methodology ("which test do I use?") but silent on process ("how do I run a research campaign?"). Add a new protocol/workflow layer:

  • Baseline heuristic first — logistic regression, bagging/boosting, vanilla fully-connected nets as the floor
  • Bleeding-edge survey — use agentic research (arXiv search, paper ingestion) to find the most promising recent approaches
  • Moonshot experiments — apply the cutting-edge ideas, compare to baseline
  • Transfer learning — strongly consider pre-trained models as a starting point
  • Hyperparameter optimization — systematic Optuna-style search across model, data, and training params
  • Distillation — if using neural networks, distill performance into smaller models (especially relevant for transfer-learned models that are larger than needed)
  • Iteration — observe best results, continue from those, rinse repeat

Deliverables:

  • New references/experimental-campaign-protocol.md — structured workflow with phase gates
  • Updated SKILL.md — add "Running a research campaign?" branch to the question classifier

2. Infrastructure Awareness ("Know Your Compute")

neopabo: "You need to make the agent aware of its available compute. Like, check vram and GPU, CUDA availability, pytorch, etc."

The current skill lists Python dependencies but has no mechanism for the agent to understand what hardware it's working with. This is critical because:

  • Experiment design depends on what fits (7B vs 13B model, batch size, quantization)
  • Different hardware justifies different approaches (CPU-only -> no deep learning, 24GB VRAM -> full fine-tuning, 8GB -> LoRA/QLoRA)
  • The agent should self-constrain its recommendations based on detected hardware

Deliverables:

  • New scripts/detect-compute.py — probes GPU model, VRAM, CUDA version, PyTorch availability, RAM, disk space, outputs structured JSON
  • New capability layer in SKILL.md — "Before running experiments, detect compute and constrain recommendations"
  • The agent reads detect-compute output and adjusts: "7B fits, 13B doesn't -> recommend LoRA, batch size 32, gradient checkpointing"

3. Subagent Supervision for Self-Healing Experiments

neopabo: "You can use subagents to supervise and auto-repair experiments. This saves a RIDICULOUS amount of time. And if things break in a way that requires human intervention, I have them text me on Telegram."

This is the most powerful and most novel feature. The pattern:

  • For each experiment in a campaign, spawn a supervisor subagent alongside the worker
  • The supervisor watches logs for known failure patterns and auto-applies fixes:
    • OOM -> reduce batch size
    • CUDA OOM -> offload layers to CPU, enable gradient checkpointing
    • NaN loss -> gradient clipping, reduce learning rate, check for bad inputs
    • ImportError -> pip install missing package
    • CUDA version mismatch -> fall back to CPU, note the constraint
  • If unfixable -> send Telegram notification to the user with context and await intervention
  • The supervisor documents every fix applied so the final report includes a "what went wrong and how it was handled" section

Deliverables:

  • New references/subagent-experiment-supervision.md — pattern description, failure catalog with fixes, Telegram alert protocol
  • Updated SKILL.md — reference this pattern in the "Campaign" workflow path
  • Note: This depends on the agent framework supporting subagent delegation (Hermes delegate_task, OpenCode subagents, etc.) — should be framed as a harness-specific pattern

neopabo: "Run experiments isolated in docker containers so they don't crash the entire PC."

A companion concern. Add guidance for Docker-based experiment isolation, container resource limits (memory, CPU, GPU device reservation), and log collection from containers.

Deliverable:

  • Included in the experimental-campaign protocol or as a standalone reference on containerized experiment runners

Implementation Order

  1. scripts/detect-compute.py — standalone, no dependencies on other features
  2. references/experimental-campaign-protocol.md — the workflow layer
  3. references/subagent-experiment-supervision.md — the agentic automation layer
  4. Wire everything together in SKILL.md decision framework
  5. (Optional) Docker isolation reference

Open Questions

  • Should the experimental campaign protocol be its own skill (e.g. research-campaign) that the data-scientist skill references, or live inside the data-scientist skill?
  • The subagent supervision pattern is harness-specific (Hermes delegate_task vs OpenCode vs Claude Code). Should we define an abstraction layer or document per-harness?
  • How much of the "bleeding-edge survey" step should be automated vs directed by the user?
## Feature Suite: Data-Scientist Skill Upgrades Based on feedback from neopabo (Nous Research Discord) who reviewed the existing skill. Three interlocking features that transform the skill from a *statistical reference* into an *experimental research conductor*. --- ### 1. Experimental Campaign Protocol The current skill is strong on methodology ("which test do I use?") but silent on process ("how do I run a research campaign?"). Add a new protocol/workflow layer: - **Baseline heuristic first** — logistic regression, bagging/boosting, vanilla fully-connected nets as the floor - **Bleeding-edge survey** — use agentic research (arXiv search, paper ingestion) to find the most promising recent approaches - **Moonshot experiments** — apply the cutting-edge ideas, compare to baseline - **Transfer learning** — strongly consider pre-trained models as a starting point - **Hyperparameter optimization** — systematic Optuna-style search across model, data, and training params - **Distillation** — if using neural networks, distill performance into smaller models (especially relevant for transfer-learned models that are larger than needed) - **Iteration** — observe best results, continue from those, rinse repeat **Deliverables:** - New `references/experimental-campaign-protocol.md` — structured workflow with phase gates - Updated `SKILL.md` — add "Running a research campaign?" branch to the question classifier ### 2. Infrastructure Awareness ("Know Your Compute") neopabo: "You need to make the agent aware of its available compute. Like, check vram and GPU, CUDA availability, pytorch, etc." The current skill lists Python dependencies but has no mechanism for the agent to understand what hardware it's working with. This is critical because: - Experiment design depends on what fits (7B vs 13B model, batch size, quantization) - Different hardware justifies different approaches (CPU-only -> no deep learning, 24GB VRAM -> full fine-tuning, 8GB -> LoRA/QLoRA) - The agent should self-constrain its recommendations based on detected hardware **Deliverables:** - New `scripts/detect-compute.py` — probes GPU model, VRAM, CUDA version, PyTorch availability, RAM, disk space, outputs structured JSON - New capability layer in SKILL.md — "Before running experiments, detect compute and constrain recommendations" - The agent reads detect-compute output and adjusts: "7B fits, 13B doesn't -> recommend LoRA, batch size 32, gradient checkpointing" ### 3. Subagent Supervision for Self-Healing Experiments neopabo: "You can use subagents to supervise and auto-repair experiments. This saves a RIDICULOUS amount of time. And if things break in a way that requires human intervention, I have them text me on Telegram." This is the most powerful and most novel feature. The pattern: - For each experiment in a campaign, spawn a **supervisor subagent** alongside the worker - The supervisor watches logs for known failure patterns and auto-applies fixes: - OOM -> reduce batch size - CUDA OOM -> offload layers to CPU, enable gradient checkpointing - NaN loss -> gradient clipping, reduce learning rate, check for bad inputs - ImportError -> `pip install` missing package - CUDA version mismatch -> fall back to CPU, note the constraint - If unfixable -> send Telegram notification to the user with context and await intervention - The supervisor documents every fix applied so the final report includes a "what went wrong and how it was handled" section **Deliverables:** - New `references/subagent-experiment-supervision.md` — pattern description, failure catalog with fixes, Telegram alert protocol - Updated `SKILL.md` — reference this pattern in the "Campaign" workflow path - Note: This depends on the agent framework supporting subagent delegation (Hermes `delegate_task`, OpenCode subagents, etc.) — should be framed as a harness-specific pattern ### Related: Docker Isolation neopabo: "Run experiments isolated in docker containers so they don't crash the entire PC." A companion concern. Add guidance for Docker-based experiment isolation, container resource limits (memory, CPU, GPU device reservation), and log collection from containers. **Deliverable:** - Included in the experimental-campaign protocol or as a standalone reference on containerized experiment runners --- ### Implementation Order 1. `scripts/detect-compute.py` — standalone, no dependencies on other features 2. `references/experimental-campaign-protocol.md` — the workflow layer 3. `references/subagent-experiment-supervision.md` — the agentic automation layer 4. Wire everything together in SKILL.md decision framework 5. (Optional) Docker isolation reference ### Open Questions - Should the experimental campaign protocol be its own skill (e.g. `research-campaign`) that the data-scientist skill references, or live inside the data-scientist skill? - The subagent supervision pattern is harness-specific (Hermes delegate_task vs OpenCode vs Claude Code). Should we define an abstraction layer or document per-harness? - How much of the "bleeding-edge survey" step should be automated vs directed by the user?
Contributor

Triage by Jasper (automated)

Assigned to @magnus. No labels configured on this repo yet — consider adding a label taxonomy (e.g. enhancement, skill, data-science, discussion) for future triage workflows.

Assessment: Three well-scoped features with clear deliverables. The implementation order in the issue body is sound — detect-compute.py is standalone and delivers immediate value independent of the other two.

Recommendations on the open questions:

  1. Separate skill vs. within data-scientist? — The campaign protocol could live as its own research-campaign skill that the data-scientist skill references, mirroring how grokto-crawl is separate from web-search. This keeps the data-scientist skill focused on methodology (statistical decision tree) while the campaign skill owns workflow orchestration. They reference each other via SKILL.md cross-links.

  2. Harness abstraction for subagent supervision? — Given the user runs Hermes, start with a Hermes-native implementation using delegate_task. Document that pattern concretely in the reference. If interest emerges from other harnesses, add a companion doc at that point. Premature abstraction is more costly than porting later.

  3. Bleeding-edge survey automation level? — The survey should auto-run as a step in the campaign protocol but present its findings to the user for approval before proceeding to moonshot experiments. Automated discovery, human-directed selection. The agent does the reading; the human does the prioritizing.

Docker isolation — Worth including as a section in the campaign protocol rather than a standalone reference, since it ties directly to the experimental runner workflow.

No blocking issues found. Ready for feature work.

**Triage by Jasper (automated)** Assigned to @magnus. No labels configured on this repo yet — consider adding a label taxonomy (e.g. `enhancement`, `skill`, `data-science`, `discussion`) for future triage workflows. **Assessment:** Three well-scoped features with clear deliverables. The implementation order in the issue body is sound — `detect-compute.py` is standalone and delivers immediate value independent of the other two. **Recommendations on the open questions:** 1. **Separate skill vs. within data-scientist?** — The campaign protocol could live as its own `research-campaign` skill that the data-scientist skill references, mirroring how `grokto-crawl` is separate from `web-search`. This keeps the data-scientist skill focused on methodology (statistical decision tree) while the campaign skill owns workflow orchestration. They reference each other via SKILL.md cross-links. 2. **Harness abstraction for subagent supervision?** — Given the user runs Hermes, start with a Hermes-native implementation using `delegate_task`. Document that pattern concretely in the reference. If interest emerges from other harnesses, add a companion doc at that point. Premature abstraction is more costly than porting later. 3. **Bleeding-edge survey automation level?** — The survey should auto-run as a step in the campaign protocol but present its findings to the user for approval before proceeding to moonshot experiments. Automated discovery, human-directed selection. The agent does the reading; the human does the prioritizing. **Docker isolation** — Worth including as a section in the campaign protocol rather than a standalone reference, since it ties directly to the experimental runner workflow. No blocking issues found. Ready for feature work.
Contributor

Triage by Jasper (automated)

Assessment: Well-scoped feature suite with clear implementation order. All three features are self-consistent with the existing skill architecture.

Labels Applied

  • enhancement — new feature suite
  • skill-upgrade — targeted skill evolution
  • community-feedback — originated from neopabo Discord review

Implementation Order Confirmed

  1. scripts/detect-compute.py — standalone, zero-dependency, highest value per line count
  2. references/experimental-campaign-protocol.md — the workflow layer that ties methodology into process
  3. references/subagent-experiment-supervision.md — harness agentic pattern (Hermes delegate_task)
  4. Wire into SKILL.md decision framework
  5. Docker isolation reference (optional)

Open Questions for Magnus

  1. Skill separation: The campaign protocol is opinionated enough that it could be a standalone research-campaign skill that the data-scientist skill references. Worth splitting?
  2. Harness abstraction: The subagent supervision pattern is Hermes-specific (delegate_task). Should we define a protocol interface so the pattern ports to OpenCode/Codex, or document per-harness?
  3. Bleeding-edge survey automation: How much of the arXiv search/paper ingestion pipeline should be automated vs user-directed? The protocol could include an optional automated survey step that feeds LightRAG.

Prioritization suggestion: detect-compute.py can land as a standalone PR immediately — zero-risk and unlocks downstream features.

## Triage by Jasper (automated) **Assessment:** Well-scoped feature suite with clear implementation order. All three features are self-consistent with the existing skill architecture. ### Labels Applied - enhancement — new feature suite - skill-upgrade — targeted skill evolution - community-feedback — originated from neopabo Discord review ### Implementation Order Confirmed 1. scripts/detect-compute.py — standalone, zero-dependency, highest value per line count 2. references/experimental-campaign-protocol.md — the workflow layer that ties methodology into process 3. references/subagent-experiment-supervision.md — harness agentic pattern (Hermes delegate_task) 4. Wire into SKILL.md decision framework 5. Docker isolation reference (optional) ### Open Questions for Magnus 1. Skill separation: The campaign protocol is opinionated enough that it could be a standalone research-campaign skill that the data-scientist skill references. Worth splitting? 2. Harness abstraction: The subagent supervision pattern is Hermes-specific (delegate_task). Should we define a protocol interface so the pattern ports to OpenCode/Codex, or document per-harness? 3. Bleeding-edge survey automation: How much of the arXiv search/paper ingestion pipeline should be automated vs user-directed? The protocol could include an optional automated survey step that feeds LightRAG. Prioritization suggestion: detect-compute.py can land as a standalone PR immediately — zero-risk and unlocks downstream features.
Author
Owner

All features implemented across 4 PRs:

PR Phase Description
#24 P1 scripts/detect-compute.py — hardware probing + Docker tests
#25 P2 references/experimental-campaign-protocol.md — 8-phase campaign workflow
#26 P2b references/pytorch-integration.md, sklearn-integration.md, data-science-coding-workflow.md
#27 P3 subagent-experiment-supervision.md, docker-experiment-isolation.md, SKILL.md wiring

Closing — all deliverables complete.

All features implemented across 4 PRs: | PR | Phase | Description | |----|-------|-------------| | #24 | P1 | scripts/detect-compute.py — hardware probing + Docker tests | | #25 | P2 | references/experimental-campaign-protocol.md — 8-phase campaign workflow | | #26 | P2b | references/pytorch-integration.md, sklearn-integration.md, data-science-coding-workflow.md | | #27 | P3 | subagent-experiment-supervision.md, docker-experiment-isolation.md, SKILL.md wiring | Closing — all deliverables complete.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
magnus/agent-skills#22
No description provided.