feat: add data-scientist skill — PhD-level data science expertise for agents #21

Merged
magnus merged 1 commit from feature/data-scientist-skill into main 2026-05-22 16:36:38 -04:00
Contributor

Summary

New skill: data-scientist — PhD-level expertise in data science, statistics, and machine learning, packaged as an Agent Skills compatible skill.

Files

  • data-scientist/SKILL.md — Decision framework, core competencies, statistical philosophy, 7-question classifier, communication standards
  • data-scientist/references/*.md — 5 reference documents: statistical methodology, experimental design, causal inference (DAGs + potential outcomes), regression modeling, Bayesian workflow
  • data-scientist/scripts/*.py — 5 automation scripts with Python default and --engine r flag for R output
  • data-scientist/assets/*.md — Analysis report and experimental plan templates
  • AGENTS.md — Added trigger table entry
  • README.md — Added skill index entry

Key design decisions

  • Decision framework classifies questions into 7 types (advice, analysis, research, design, review, methodology, clarify) with appropriate rigor per type
  • Causal inference covers both graphical (DAGs/Pearl) and potential outcomes (Rubin) traditions
  • Scripts support Python + R via --engine r flag
  • No personal infrastructure references — ready for global distribution
  • SKILL.md is 220 lines / ~3092 tokens (under spec limits)
## Summary New skill: **data-scientist** — PhD-level expertise in data science, statistics, and machine learning, packaged as an Agent Skills compatible skill. ### Files - `data-scientist/SKILL.md` — Decision framework, core competencies, statistical philosophy, 7-question classifier, communication standards - `data-scientist/references/*.md` — 5 reference documents: statistical methodology, experimental design, causal inference (DAGs + potential outcomes), regression modeling, Bayesian workflow - `data-scientist/scripts/*.py` — 5 automation scripts with Python default and `--engine r` flag for R output - `data-scientist/assets/*.md` — Analysis report and experimental plan templates - `AGENTS.md` — Added trigger table entry - `README.md` — Added skill index entry ### Key design decisions - Decision framework classifies questions into 7 types (advice, analysis, research, design, review, methodology, clarify) with appropriate rigor per type - Causal inference covers both graphical (DAGs/Pearl) and potential outcomes (Rubin) traditions - Scripts support Python + R via `--engine r` flag - No personal infrastructure references — ready for global distribution - SKILL.md is 220 lines / ~3092 tokens (under spec limits)
PhD-level data science expertise with decision framework, five reference
documents (statistical methodology, experimental design, causal inference,
regression modeling, Bayesian workflow), five automation scripts (power
analysis, assumption diagnostics, model comparison, effect size calculator,
experimental design generator), and two report templates.

Python default with --engine r flag for R output. Dual language support.
magnus merged commit 975f295c94 into main 2026-05-22 16:36:38 -04:00
jasper left a comment

First-Pass Review: data-scientist skill

Overall, this is a well-constructed skill. The structure follows Agent Skills conventions, the SKILL.md is comprehensive at ~3K tokens, and the 5 scripts cover real needs for a data science analysis workflow. A few things to clean up before merge:

1. AGENTS.md table formatting (lines 60-62)

The three new/changed rows use || (double pipe) instead of | (single pipe) at the start. All other rows in the table — including the header and the surrounding unchanged rows — use single |. The double pipe creates an extra empty column in the rendered markdown table.

Fix: Change || to | on lines 60-62.

2. power-analysis.py: Dead code in solve_power_prop (line 88)

p_bar = (p1 + ratio * p2) / (1 + ratio) is computed but never used in the function body. If the intent was a pooled-variance formula, consider using it; otherwise remove the dead variable.

3. power-analysis.py: / 1 no-op (line 91)

p1 * (1 - p1) / 1 — dividing by 1 is a no-op that makes the formula harder to read. The ratio adjustment already handles unequal groups. Recommend simplifying to p1 * (1 - p1).

4. Scripts: missing executable permissions

All 5 scripts have #!/usr/bin/env python3 shebangs but lack executable bits (chmod +x). Consider adding and committing.

5. (Minor) supports-metadata naming

The metadata.skills list uses domain labels (statistics, machine-learning) that are not actual skill names in this repo. This is valid per spec (they are node labels, not cross-references), but worth noting for consistency with other skills in this repo.


Review by Jasper (automated review)

## First-Pass Review: data-scientist skill Overall, this is a well-constructed skill. The structure follows Agent Skills conventions, the SKILL.md is comprehensive at ~3K tokens, and the 5 scripts cover real needs for a data science analysis workflow. A few things to clean up before merge: ### 1. AGENTS.md table formatting (lines 60-62) The three new/changed rows use `||` (double pipe) instead of `|` (single pipe) at the start. All other rows in the table — including the header and the surrounding unchanged rows — use single `|`. The double pipe creates an extra empty column in the rendered markdown table. **Fix:** Change `||` to `|` on lines 60-62. ### 2. power-analysis.py: Dead code in solve_power_prop (line 88) `p_bar = (p1 + ratio * p2) / (1 + ratio)` is computed but never used in the function body. If the intent was a pooled-variance formula, consider using it; otherwise remove the dead variable. ### 3. power-analysis.py: `/ 1` no-op (line 91) `p1 * (1 - p1) / 1` — dividing by 1 is a no-op that makes the formula harder to read. The `ratio` adjustment already handles unequal groups. Recommend simplifying to `p1 * (1 - p1)`. ### 4. Scripts: missing executable permissions All 5 scripts have `#!/usr/bin/env python3` shebangs but lack executable bits (`chmod +x`). Consider adding and committing. ### 5. (Minor) supports-metadata naming The `metadata.skills` list uses domain labels (`statistics`, `machine-learning`) that are not actual skill names in this repo. This is valid per spec (they are node labels, not cross-references), but worth noting for consistency with other skills in this repo. --- *Review by Jasper (automated review)*
jasper left a comment

First-Pass Review: data-scientist skill

Overall, this is a well-constructed skill. The structure follows Agent Skills conventions, the SKILL.md is comprehensive at ~3K tokens, and the 5 scripts cover real needs. A few things to address:

1. AGENTS.md table formatting: Three rows use || instead of |.
2. power-analysis.py: Dead p_bar variable + confusing / 1 no-op.
3. Scripts: Missing executable permissions on all 5 scripts.
4. metadata.skills: Domain labels (statistics, machine-learning) not actual skill names.

See inline comments for line-level details.

## First-Pass Review: data-scientist skill Overall, this is a well-constructed skill. The structure follows Agent Skills conventions, the SKILL.md is comprehensive at ~3K tokens, and the 5 scripts cover real needs. A few things to address: **1. AGENTS.md table formatting:** Three rows use `||` instead of `|`. **2. power-analysis.py:** Dead `p_bar` variable + confusing `/ 1` no-op. **3. Scripts:** Missing executable permissions on all 5 scripts. **4. metadata.skills:** Domain labels (statistics, machine-learning) not actual skill names. See inline comments for line-level details.
Author
Contributor

Formatting: Double pipe || creates an extra empty column in the rendered markdown table. All other rows use single |. Change ||| on this and the two similar rows below.

**Formatting:** Double pipe `||` creates an extra empty column in the rendered markdown table. All other rows use single `|`. Change `||` → `|` on this and the two similar rows below.
Author
Contributor

Same issue — || should be | for consistent table rendering.

Same issue — `||` should be `|` for consistent table rendering.
Author
Contributor

Same issue — || should be |.

Same issue — `||` should be `|`.
jasper left a comment

Power analysis script review — two minor issues.

Power analysis script review — two minor issues.
Author
Contributor

Dead code: p_bar is computed here but never used in the function body. If the intent was a pooled-variance formula, consider using it instead of the unpooled version. Otherwise remove this line.

**Dead code:** `p_bar` is computed here but never used in the function body. If the intent was a pooled-variance formula, consider using it instead of the unpooled version. Otherwise remove this line.
Author
Contributor

No-op: p1 * (1 - p1) / 1 — dividing by 1 has no effect and makes the formula harder to read. The ratio parameter already handles unequal group sizes. Recommend simplifying to p1 * (1-p1).

**No-op:** `p1 * (1 - p1) / 1` — dividing by 1 has no effect and makes the formula harder to read. The `ratio` parameter already handles unequal group sizes. Recommend simplifying to `p1 * (1-p1)`.
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
magnus/agent-skills!21
No description provided.