Data-Scientist Skill: Researched PyTorch + Scikit-Learn + DS Coding Workflow References #23

Closed
opened 2026-05-23 16:59:52 -04:00 by magnus · 3 comments
Owner

Researched References: PyTorch + Scikit-Learn + Data Science Coding Workflow

The data-scientist skill currently provides strong methodology guidance (which test to use, how to structure an analysis) but lacks integrated code-level expertise — the agent has no researched reference to reach for when it needs to write a PyTorch training loop, compose an sklearn pipeline, or set up an experiment directory.

Deliverables

Three new reference documents, researched from current documentation and best practices (not generated from parametric knowledge):

1. references/pytorch-integration.md

Researched from pytorch.org/docs/stable and current best practices. Covers:

  • Device managementtorch.device, .to(device), accelerator pattern, MPS support detection
  • Training loop patterns — canonical supervised loop, epoch iteration, batch processing, gradient accumulation
  • Dataset & DataLoader — custom Dataset, collate functions, worker configuration, shuffling, pin_memory
  • Model saving/loadingstate_dict, full model save, torch.save/torch.load, checkpointing with optimizer state
  • Mixed precisiontorch.cuda.amp.autocast + GradScaler
  • Distributed trainingDistributedDataParallel overview (when needed)
  • Loss functions — common choices mapped to problem types
  • Optimizers & schedulers — AdamW, SGD, OneCycleLR, ReduceLROnPlateau
  • Transfer learningtorchvision.models feature extraction vs fine-tuning, layer freezing
  • Knowledge distillation — teacher-student pattern, temperature scaling, KL divergence loss
  • Debugging — gradient checking, NaN detection, overfitting on a single batch

Source validation: Verify each pattern against current PyTorch 2.x docs. Include version compatibility notes.

2. references/sklearn-integration.md

Researched from scikit-learn.org/stable and current best practices. Covers:

  • Pipeline compositionPipeline, make_pipeline, named steps, set_params for grid search
  • ColumnTransformer — heterogeneous data (numeric vs categorical), ColumnTransformer.make_column_selector
  • PreprocessingStandardScaler, OneHotEncoder, OrdinalEncoder, SimpleImputer, iterative imputation
  • Model selectionGridSearchCV, RandomizedSearchCV, custom ParameterGrid, nested CV
  • Cross-validation strategies — KFold, StratifiedKFold, GroupKFold, TimeSeriesSplit
  • Ensemble patterns — stacking, voting, bagging, boosting (XGBoost/LightGBM integration note)
  • Custom estimatorsBaseEstimator + TransformerMixin protocol
  • Persistencejoblib.dump/load, version compatibility between sklearn versions
  • Imbalanced data — class_weight, SMOTE (imblearn), stratify
  • Feature engineering — polynomial features, KBinsDiscretizer, FunctionTransformer
  • CalibrationCalibratedClassifierCV for probability calibration

Source validation: Verify against current sklearn 1.6+ API. Note deprecations and removals.

3. references/data-science-coding-workflow.md

Researched from DS project conventions and reproducibility best practices. Covers:

  • Project directory structure — standard layout (data/, notebooks/, src/, models/, reports/, config/)
  • Configuration management — YAML/OmegaConf/Hydra patterns for experiment config
  • Experiment logging — MLflow tracking, TensorBoard, WandB (comparison and when to use what)
  • Result serialization — standard file formats (parquet for data, json for metrics, onnx/pickle for models)
  • Reproducibility — random seed management, environment pinning, Docker for full reproducibility
  • Data versioning — DVC patterns, hash-based caching, data integrity checks
  • Unit testing for DS — test data generation, assert on small known-output cases, model invariance tests
  • Documentation — auto-generated experiment reports, decision logging (ADRs for data science)

Source validation: Reference established case studies and DS project templates (Cookiecutter Data Science, MLflow examples, DVC documentation).

Files Modified

  • data-scientist/SKILL.md — add all three references to Available Resources section; update compatibility to explicitly call out PyTorch/sklearn integration
  • data-scientist/README.md — add reference summaries if it's being kept in sync

Out of Scope

  • Writing complete tutorials or notebooks (these are condensed references for agent consumption)
  • Covering every PyTorch API (focused on patterns an agent will likely need during a research campaign)
  • JAX/TensorFlow integration (separate follow-up if needed)

Researched, Not Generated

Each reference must be validated against current documentation (pytorch.org, scikit-learn.org, established blog posts from authoritative sources). The goal is to provide the agent with information that is more current than its training cutoff and more specific to the DS campaign workflow than general knowledge.

## Researched References: PyTorch + Scikit-Learn + Data Science Coding Workflow The data-scientist skill currently provides strong *methodology* guidance (which test to use, how to structure an analysis) but lacks integrated *code-level expertise* — the agent has no researched reference to reach for when it needs to write a PyTorch training loop, compose an sklearn pipeline, or set up an experiment directory. ### Deliverables Three new reference documents, researched from current documentation and best practices (not generated from parametric knowledge): #### 1. `references/pytorch-integration.md` Researched from pytorch.org/docs/stable and current best practices. Covers: - **Device management** — `torch.device`, `.to(device)`, `accelerator` pattern, MPS support detection - **Training loop patterns** — canonical supervised loop, epoch iteration, batch processing, gradient accumulation - **Dataset & DataLoader** — custom `Dataset`, collate functions, worker configuration, shuffling, pin_memory - **Model saving/loading** — `state_dict`, full model save, `torch.save`/`torch.load`, checkpointing with optimizer state - **Mixed precision** — `torch.cuda.amp.autocast` + `GradScaler` - **Distributed training** — `DistributedDataParallel` overview (when needed) - **Loss functions** — common choices mapped to problem types - **Optimizers & schedulers** — AdamW, SGD, OneCycleLR, ReduceLROnPlateau - **Transfer learning** — `torchvision.models` feature extraction vs fine-tuning, layer freezing - **Knowledge distillation** — teacher-student pattern, temperature scaling, KL divergence loss - **Debugging** — gradient checking, NaN detection, overfitting on a single batch **Source validation:** Verify each pattern against current PyTorch 2.x docs. Include version compatibility notes. #### 2. `references/sklearn-integration.md` Researched from scikit-learn.org/stable and current best practices. Covers: - **Pipeline composition** — `Pipeline`, `make_pipeline`, named steps, `set_params` for grid search - **ColumnTransformer** — heterogeneous data (numeric vs categorical), `ColumnTransformer.make_column_selector` - **Preprocessing** — `StandardScaler`, `OneHotEncoder`, `OrdinalEncoder`, `SimpleImputer`, iterative imputation - **Model selection** — `GridSearchCV`, `RandomizedSearchCV`, custom `ParameterGrid`, nested CV - **Cross-validation strategies** — KFold, StratifiedKFold, GroupKFold, TimeSeriesSplit - **Ensemble patterns** — stacking, voting, bagging, boosting (XGBoost/LightGBM integration note) - **Custom estimators** — `BaseEstimator` + `TransformerMixin` protocol - **Persistence** — `joblib.dump`/`load`, version compatibility between sklearn versions - **Imbalanced data** — class_weight, SMOTE (imblearn), stratify - **Feature engineering** — polynomial features, KBinsDiscretizer, `FunctionTransformer` - **Calibration** — `CalibratedClassifierCV` for probability calibration **Source validation:** Verify against current sklearn 1.6+ API. Note deprecations and removals. #### 3. `references/data-science-coding-workflow.md` Researched from DS project conventions and reproducibility best practices. Covers: - **Project directory structure** — standard layout (data/, notebooks/, src/, models/, reports/, config/) - **Configuration management** — YAML/OmegaConf/Hydra patterns for experiment config - **Experiment logging** — MLflow tracking, TensorBoard, WandB (comparison and when to use what) - **Result serialization** — standard file formats (parquet for data, json for metrics, onnx/pickle for models) - **Reproducibility** — random seed management, environment pinning, Docker for full reproducibility - **Data versioning** — DVC patterns, hash-based caching, data integrity checks - **Unit testing for DS** — test data generation, assert on small known-output cases, model invariance tests - **Documentation** — auto-generated experiment reports, decision logging (ADRs for data science) **Source validation:** Reference established case studies and DS project templates (Cookiecutter Data Science, MLflow examples, DVC documentation). ### Files Modified - `data-scientist/SKILL.md` — add all three references to Available Resources section; update compatibility to explicitly call out PyTorch/sklearn integration - `data-scientist/README.md` — add reference summaries if it's being kept in sync ### Out of Scope - Writing complete tutorials or notebooks (these are condensed references for agent consumption) - Covering every PyTorch API (focused on patterns an agent will likely need during a research campaign) - JAX/TensorFlow integration (separate follow-up if needed) ### Researched, Not Generated Each reference must be validated against current documentation (pytorch.org, scikit-learn.org, established blog posts from authoritative sources). The goal is to provide the agent with information that is *more current* than its training cutoff and *more specific* to the DS campaign workflow than general knowledge.
Contributor

Triage — Jasper (automated)

Label: skill-upgrade — adds researched reference documents to the existing data-scientist skill.

Assessment: Well-scoped issue. The three proposed references address a genuine gap: the skill currently has strong methodology guidance but no code-level references for PyTorch training loops, sklearn pipelines, or DS project workflows.

Notes:

  • All three references are well-defined with specific topics and source validation requirements.
  • The "researched, not generated" constraint is the key quality bar — these need to be validated against current docs (pytorch.org 2.x, scikit-learn.org 1.6+), not written from parametric knowledge.
  • Out-of-scope items (JAX/TensorFlow, full tutorials) are clearly called out, keeping this focused.
  • This is one of the first issues in the repo getting proper triage with the new label system.

Suggested approach:

  1. Research phase: crawl pytorch.org, scikit-learn.org, and DS workflow references (Cookiecutter DS, MLflow docs, DVC docs)
  2. Write phase: create each reference document, validate patterns against actual doc content
  3. Integration: update SKILL.md's Available Resources section + README if maintained

Size: Medium (~3 reference docs + minor SKILL.md updates)

## Triage — Jasper (automated) **Label:** `skill-upgrade` — adds researched reference documents to the existing data-scientist skill. **Assessment:** Well-scoped issue. The three proposed references address a genuine gap: the skill currently has strong methodology guidance but no code-level references for PyTorch training loops, sklearn pipelines, or DS project workflows. **Notes:** - All three references are well-defined with specific topics and source validation requirements. - The "researched, not generated" constraint is the key quality bar — these need to be validated against current docs (pytorch.org 2.x, scikit-learn.org 1.6+), not written from parametric knowledge. - Out-of-scope items (JAX/TensorFlow, full tutorials) are clearly called out, keeping this focused. - This is one of the first issues in the repo getting proper triage with the new label system. **Suggested approach:** 1. Research phase: crawl pytorch.org, scikit-learn.org, and DS workflow references (Cookiecutter DS, MLflow docs, DVC docs) 2. Write phase: create each reference document, validate patterns against actual doc content 3. Integration: update SKILL.md's Available Resources section + README if maintained **Size:** Medium (~3 reference docs + minor SKILL.md updates)
Contributor

Triage

Label: skill-upgrade — well-scoped enhancement to the data-scientist skill adding three researched reference documents for code-level expertise.

Assessment: Well-structured issue with clear deliverables, source validation requirements, and explicit out-of-scope boundaries. The "researched, not generated" constraint is correctly emphasized.

Recommendation: Ready for implementation. Each reference should be validated against live documentation (pytorch.org, scikit-learn.org, DS project templates) rather than generated from parametric knowledge.

— Jasper (automated)

### Triage **Label:** `skill-upgrade` — well-scoped enhancement to the data-scientist skill adding three researched reference documents for code-level expertise. **Assessment:** Well-structured issue with clear deliverables, source validation requirements, and explicit out-of-scope boundaries. The "researched, not generated" constraint is correctly emphasized. **Recommendation:** Ready for implementation. Each reference should be validated against live documentation (pytorch.org, scikit-learn.org, DS project templates) rather than generated from parametric knowledge. — Jasper (automated)
Author
Owner

Delivered via PR #26:

  • references/pytorch-integration.md — validated against PyTorch 2.12 docs. Covers device management, training loops, AMP, torch.compile, transfer learning, LoRA, distillation, pruning, DDP, debugging.
  • references/sklearn-integration.md — validated against scikit-learn 1.8.0 docs. Covers pipelines, ColumnTransformer, model selection, ensembles, calibration, custom estimators, feature selection, PCA.
  • references/data-science-coding-workflow.md — project structure, config management, experiment logging (JSON/MLflow/TensorBoard/WandB), reproducibility, DVC, unit testing.

66/66 validation tests passing. Every API name cross-checked against current docs.

Closing.

Delivered via PR #26: - references/pytorch-integration.md — validated against PyTorch 2.12 docs. Covers device management, training loops, AMP, torch.compile, transfer learning, LoRA, distillation, pruning, DDP, debugging. - references/sklearn-integration.md — validated against scikit-learn 1.8.0 docs. Covers pipelines, ColumnTransformer, model selection, ensembles, calibration, custom estimators, feature selection, PCA. - references/data-science-coding-workflow.md — project structure, config management, experiment logging (JSON/MLflow/TensorBoard/WandB), reproducibility, DVC, unit testing. 66/66 validation tests passing. Every API name cross-checked against current docs. Closing.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
magnus/agent-skills#23
No description provided.