magnus/agent-skills

Fork 0

Data-Scientist Skill: Researched PyTorch + Scikit-Learn + DS Coding Workflow References #23

New issue

Closed

opened 2026-05-23 16:59:52 -04:00 by magnus · 3 comments

magnus commented

2026-05-23 16:59:52 -04:00

Owner

Researched References: PyTorch + Scikit-Learn + Data Science Coding Workflow

The data-scientist skill currently provides strong methodology guidance (which test to use, how to structure an analysis) but lacks integrated code-level expertise — the agent has no researched reference to reach for when it needs to write a PyTorch training loop, compose an sklearn pipeline, or set up an experiment directory.

Deliverables

Three new reference documents, researched from current documentation and best practices (not generated from parametric knowledge):

1. `references/pytorch-integration.md`

Researched from pytorch.org/docs/stable and current best practices. Covers:

Device management — torch.device, .to(device), accelerator pattern, MPS support detection
Training loop patterns — canonical supervised loop, epoch iteration, batch processing, gradient accumulation
Dataset & DataLoader — custom Dataset, collate functions, worker configuration, shuffling, pin_memory
Model saving/loading — state_dict, full model save, torch.save/torch.load, checkpointing with optimizer state
Mixed precision — torch.cuda.amp.autocast + GradScaler
Distributed training — DistributedDataParallel overview (when needed)
Loss functions — common choices mapped to problem types
Optimizers & schedulers — AdamW, SGD, OneCycleLR, ReduceLROnPlateau
Transfer learning — torchvision.models feature extraction vs fine-tuning, layer freezing
Knowledge distillation — teacher-student pattern, temperature scaling, KL divergence loss
Debugging — gradient checking, NaN detection, overfitting on a single batch

Source validation: Verify each pattern against current PyTorch 2.x docs. Include version compatibility notes.

2. `references/sklearn-integration.md`

Researched from scikit-learn.org/stable and current best practices. Covers:

Pipeline composition — Pipeline, make_pipeline, named steps, set_params for grid search
ColumnTransformer — heterogeneous data (numeric vs categorical), ColumnTransformer.make_column_selector
Preprocessing — StandardScaler, OneHotEncoder, OrdinalEncoder, SimpleImputer, iterative imputation
Model selection — GridSearchCV, RandomizedSearchCV, custom ParameterGrid, nested CV
Cross-validation strategies — KFold, StratifiedKFold, GroupKFold, TimeSeriesSplit
Ensemble patterns — stacking, voting, bagging, boosting (XGBoost/LightGBM integration note)
Custom estimators — BaseEstimator + TransformerMixin protocol
Persistence — joblib.dump/load, version compatibility between sklearn versions
Imbalanced data — class_weight, SMOTE (imblearn), stratify
Feature engineering — polynomial features, KBinsDiscretizer, FunctionTransformer
Calibration — CalibratedClassifierCV for probability calibration

Source validation: Verify against current sklearn 1.6+ API. Note deprecations and removals.

3. `references/data-science-coding-workflow.md`

Researched from DS project conventions and reproducibility best practices. Covers:

Project directory structure — standard layout (data/, notebooks/, src/, models/, reports/, config/)
Configuration management — YAML/OmegaConf/Hydra patterns for experiment config
Experiment logging — MLflow tracking, TensorBoard, WandB (comparison and when to use what)
Result serialization — standard file formats (parquet for data, json for metrics, onnx/pickle for models)
Reproducibility — random seed management, environment pinning, Docker for full reproducibility
Data versioning — DVC patterns, hash-based caching, data integrity checks
Unit testing for DS — test data generation, assert on small known-output cases, model invariance tests
Documentation — auto-generated experiment reports, decision logging (ADRs for data science)

Source validation: Reference established case studies and DS project templates (Cookiecutter Data Science, MLflow examples, DVC documentation).

Files Modified

data-scientist/SKILL.md — add all three references to Available Resources section; update compatibility to explicitly call out PyTorch/sklearn integration
data-scientist/README.md — add reference summaries if it's being kept in sync

Out of Scope

Writing complete tutorials or notebooks (these are condensed references for agent consumption)
Covering every PyTorch API (focused on patterns an agent will likely need during a research campaign)
JAX/TensorFlow integration (separate follow-up if needed)

Researched, Not Generated

Each reference must be validated against current documentation (pytorch.org, scikit-learn.org, established blog posts from authoritative sources). The goal is to provide the agent with information that is more current than its training cutoff and more specific to the DS campaign workflow than general knowledge.

## Researched References: PyTorch + Scikit-Learn + Data Science Coding Workflow The data-scientist skill currently provides strong *methodology* guidance (which test to use, how to structure an analysis) but lacks integrated *code-level expertise* — the agent has no researched reference to reach for when it needs to write a PyTorch training loop, compose an sklearn pipeline, or set up an experiment directory. ### Deliverables Three new reference documents, researched from current documentation and best practices (not generated from parametric knowledge): #### 1. `references/pytorch-integration.md` Researched from pytorch.org/docs/stable and current best practices. Covers: - **Device management** — `torch.device`, `.to(device)`, `accelerator` pattern, MPS support detection - **Training loop patterns** — canonical supervised loop, epoch iteration, batch processing, gradient accumulation - **Dataset & DataLoader** — custom `Dataset`, collate functions, worker configuration, shuffling, pin_memory - **Model saving/loading** — `state_dict`, full model save, `torch.save`/`torch.load`, checkpointing with optimizer state - **Mixed precision** — `torch.cuda.amp.autocast` + `GradScaler` - **Distributed training** — `DistributedDataParallel` overview (when needed) - **Loss functions** — common choices mapped to problem types - **Optimizers & schedulers** — AdamW, SGD, OneCycleLR, ReduceLROnPlateau - **Transfer learning** — `torchvision.models` feature extraction vs fine-tuning, layer freezing - **Knowledge distillation** — teacher-student pattern, temperature scaling, KL divergence loss - **Debugging** — gradient checking, NaN detection, overfitting on a single batch **Source validation:** Verify each pattern against current PyTorch 2.x docs. Include version compatibility notes. #### 2. `references/sklearn-integration.md` Researched from scikit-learn.org/stable and current best practices. Covers: - **Pipeline composition** — `Pipeline`, `make_pipeline`, named steps, `set_params` for grid search - **ColumnTransformer** — heterogeneous data (numeric vs categorical), `ColumnTransformer.make_column_selector` - **Preprocessing** — `StandardScaler`, `OneHotEncoder`, `OrdinalEncoder`, `SimpleImputer`, iterative imputation - **Model selection** — `GridSearchCV`, `RandomizedSearchCV`, custom `ParameterGrid`, nested CV - **Cross-validation strategies** — KFold, StratifiedKFold, GroupKFold, TimeSeriesSplit - **Ensemble patterns** — stacking, voting, bagging, boosting (XGBoost/LightGBM integration note) - **Custom estimators** — `BaseEstimator` + `TransformerMixin` protocol - **Persistence** — `joblib.dump`/`load`, version compatibility between sklearn versions - **Imbalanced data** — class_weight, SMOTE (imblearn), stratify - **Feature engineering** — polynomial features, KBinsDiscretizer, `FunctionTransformer` - **Calibration** — `CalibratedClassifierCV` for probability calibration **Source validation:** Verify against current sklearn 1.6+ API. Note deprecations and removals. #### 3. `references/data-science-coding-workflow.md` Researched from DS project conventions and reproducibility best practices. Covers: - **Project directory structure** — standard layout (data/, notebooks/, src/, models/, reports/, config/) - **Configuration management** — YAML/OmegaConf/Hydra patterns for experiment config - **Experiment logging** — MLflow tracking, TensorBoard, WandB (comparison and when to use what) - **Result serialization** — standard file formats (parquet for data, json for metrics, onnx/pickle for models) - **Reproducibility** — random seed management, environment pinning, Docker for full reproducibility - **Data versioning** — DVC patterns, hash-based caching, data integrity checks - **Unit testing for DS** — test data generation, assert on small known-output cases, model invariance tests - **Documentation** — auto-generated experiment reports, decision logging (ADRs for data science) **Source validation:** Reference established case studies and DS project templates (Cookiecutter Data Science, MLflow examples, DVC documentation). ### Files Modified - `data-scientist/SKILL.md` — add all three references to Available Resources section; update compatibility to explicitly call out PyTorch/sklearn integration - `data-scientist/README.md` — add reference summaries if it's being kept in sync ### Out of Scope - Writing complete tutorials or notebooks (these are condensed references for agent consumption) - Covering every PyTorch API (focused on patterns an agent will likely need during a research campaign) - JAX/TensorFlow integration (separate follow-up if needed) ### Researched, Not Generated Each reference must be validated against current documentation (pytorch.org, scikit-learn.org, established blog posts from authoritative sources). The goal is to provide the agent with information that is *more current* than its training cutoff and *more specific* to the DS campaign workflow than general knowledge.

magnus was assigned by jasper

2026-05-23 17:00:18 -04:00

jasper added the

skill-upgrade

label

2026-05-23 17:00:39 -04:00

jasper commented

2026-05-23 17:00:48 -04:00

Contributor

Triage — Jasper (automated)

Label: skill-upgrade — adds researched reference documents to the existing data-scientist skill.

Assessment: Well-scoped issue. The three proposed references address a genuine gap: the skill currently has strong methodology guidance but no code-level references for PyTorch training loops, sklearn pipelines, or DS project workflows.

Notes:

All three references are well-defined with specific topics and source validation requirements.
The "researched, not generated" constraint is the key quality bar — these need to be validated against current docs (pytorch.org 2.x, scikit-learn.org 1.6+), not written from parametric knowledge.
Out-of-scope items (JAX/TensorFlow, full tutorials) are clearly called out, keeping this focused.
This is one of the first issues in the repo getting proper triage with the new label system.

Suggested approach:

Research phase: crawl pytorch.org, scikit-learn.org, and DS workflow references (Cookiecutter DS, MLflow docs, DVC docs)
Write phase: create each reference document, validate patterns against actual doc content
Integration: update SKILL.md's Available Resources section + README if maintained

Size: Medium (~3 reference docs + minor SKILL.md updates)

## Triage — Jasper (automated) **Label:** `skill-upgrade` — adds researched reference documents to the existing data-scientist skill. **Assessment:** Well-scoped issue. The three proposed references address a genuine gap: the skill currently has strong methodology guidance but no code-level references for PyTorch training loops, sklearn pipelines, or DS project workflows. **Notes:** - All three references are well-defined with specific topics and source validation requirements. - The "researched, not generated" constraint is the key quality bar — these need to be validated against current docs (pytorch.org 2.x, scikit-learn.org 1.6+), not written from parametric knowledge. - Out-of-scope items (JAX/TensorFlow, full tutorials) are clearly called out, keeping this focused. - This is one of the first issues in the repo getting proper triage with the new label system. **Suggested approach:** 1. Research phase: crawl pytorch.org, scikit-learn.org, and DS workflow references (Cookiecutter DS, MLflow docs, DVC docs) 2. Write phase: create each reference document, validate patterns against actual doc content 3. Integration: update SKILL.md's Available Resources section + README if maintained **Size:** Medium (~3 reference docs + minor SKILL.md updates)

jasper commented

2026-05-23 17:00:58 -04:00

Contributor

Triage

Label: skill-upgrade — well-scoped enhancement to the data-scientist skill adding three researched reference documents for code-level expertise.

Assessment: Well-structured issue with clear deliverables, source validation requirements, and explicit out-of-scope boundaries. The "researched, not generated" constraint is correctly emphasized.

Recommendation: Ready for implementation. Each reference should be validated against live documentation (pytorch.org, scikit-learn.org, DS project templates) rather than generated from parametric knowledge.

— Jasper (automated)

### Triage **Label:** `skill-upgrade` — well-scoped enhancement to the data-scientist skill adding three researched reference documents for code-level expertise. **Assessment:** Well-structured issue with clear deliverables, source validation requirements, and explicit out-of-scope boundaries. The "researched, not generated" constraint is correctly emphasized. **Recommendation:** Ready for implementation. Each reference should be validated against live documentation (pytorch.org, scikit-learn.org, DS project templates) rather than generated from parametric knowledge. — Jasper (automated)

magnus referenced this issue from a commit

2026-05-23 17:10:13 -04:00

feat: add researched code integration references — PyTorch, sklearn, DS workflow

magnus referenced this issue from a pull request that will close it,

2026-05-23 17:10:22 -04:00

feat: add researched code integration references — PyTorch, sklearn, DS workflow #26

magnus commented

2026-05-23 17:13:07 -04:00

Author

Owner

Delivered via PR #26:

references/pytorch-integration.md — validated against PyTorch 2.12 docs. Covers device management, training loops, AMP, torch.compile, transfer learning, LoRA, distillation, pruning, DDP, debugging.
references/sklearn-integration.md — validated against scikit-learn 1.8.0 docs. Covers pipelines, ColumnTransformer, model selection, ensembles, calibration, custom estimators, feature selection, PCA.
references/data-science-coding-workflow.md — project structure, config management, experiment logging (JSON/MLflow/TensorBoard/WandB), reproducibility, DVC, unit testing.

66/66 validation tests passing. Every API name cross-checked against current docs.

Closing.

Delivered via PR #26: - references/pytorch-integration.md — validated against PyTorch 2.12 docs. Covers device management, training loops, AMP, torch.compile, transfer learning, LoRA, distillation, pruning, DDP, debugging. - references/sklearn-integration.md — validated against scikit-learn 1.8.0 docs. Covers pipelines, ColumnTransformer, model selection, ensembles, calibration, custom estimators, feature selection, PCA. - references/data-science-coding-workflow.md — project structure, config management, experiment logging (JSON/MLflow/TensorBoard/WandB), reproducibility, DVC, unit testing. 66/66 validation tests passing. Every API name cross-checked against current docs. Closing.

magnus closed this issue

2026-05-23 17:13:07 -04:00

No labels

community-feedback

enhancement

skill-upgrade

No milestone

No project

No assignees

2 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

magnus/agent-skills#23

No description provided.

Rows
Columns