Maintainers¶
Essential commands and paths for day‑to‑day maintenance.
Do This¶
Review Dataset Submissions¶
Contributors now submit complete packages with documentation. Your role:
- Review PR contents:
- ✅ Dataset parquet file in
converted_datasets/ - ✅ Documentation in
docs/datasets/ -
✅ Conversion script in
contributor_tools/conversion_scripts/ -
Check validation results:
- Review validation pass rates in documentation and call out large drops between raw vs. clean strides
- Confirm contributors explain persistent violations or intentional exclusions
-
Check for appropriate task coverage
-
Verify metadata:
- Short code is unique
- Institution and citation provided
-
Subject count and tasks documented
-
Merge if complete:
- All files present
- Validation acceptable
- Documentation complete
Quick Validation Tools¶
- Test dataset:
python contributor_tools/quick_validation_check.py <dataset_phase.parquet> - Filter strides:
python contributor_tools/create_clean_dataset.py <dataset_phase.parquet> - Serve docs locally:
mkdocs serve
Validation Checks¶
Validation Range Schema¶
All phase-indexed validators consume YAML files that mirror this structure:
version: "2.0"
generated: "YYYY-MM-DD HH:MM:SS"
description: Optional free-form text
tasks:
task_name:
metadata: {} # optional, reserved for future keys
phases:
0: # phase percentage as integer 0–100
hip_flexion_angle_ipsi_rad:
min: -0.25
max: 1.05
hip_flexion_angle_contra_rad:
min: -0.30
max: 0.95
50:
grf_vertical_ipsi_BW:
min: 0.0
max: 1.4
Key points:
- Phase Keys are integers; loaders coerce numeric strings but reject anything outside 0–100.
- Variables are stored exactly as used downstream (both
_ipsiand_contraentries are explicit). - Ranges are simple
min/maxfloats; missing values should be omitted entirely instead of set tonull. metadatais optional and currently unused, but preserved verbatim so future per-task flags can be added without code changes.
ValidationConfigManager simply deep-copies this schema—there is no automatic contralateral generation anymore. Update both limbs directly whenever ranges need to diverge.
Phase-Indexed Data¶
Maintainers expect every 150-sample stride to satisfy the phase-based validator:
- Envelope Match: Each feature must stay inside the normative range YAML (mean ± tolerance) across the full 0–100% phase. Outliers mark stride/feature failures.
- Event Alignment: Heel-strike / toe-off event timestamps should align with the standard template (within ±5% phase). Large shifts imply bad segmentation or mislabeled leading limb.
- Cycle Sanity: Basic stats (peak knee flexion, ankle dorsiflexion timing, stride duration) are cross-checked against dataset metadata. Values outside physiological windows flag either unit errors or incorrect phase normalization.
- Symmetry Spot-Checks: For bilateral features, ipsi/contra differences beyond configured thresholds highlight swapped limbs or sign inversions before envelope comparisons even run.
Use the validator report (and interactive_validation_tuner.py when needed) to adjust ranges or request converter fixes.
Time-Indexed Data¶
Episodes that remain in time space (non-cyclic tasks) run a lightweight structural suite aimed at catching systemic issues:
- Baseline Offset Audit: Identify quasi-static frames (e.g., first 0.5 s) and require velocities/accelerations to average ~0, vertical acceleration near 1 g. Flags zeroing and bias problems.
- Derivative/Product Consistency: Integrate angular velocity back to the recorded angle (after high-pass filtering) and verify
power ≈ moment × angular_velocity. Divergence indicates sign flips or scaling errors. - Cross-Limb Correlation: Normalize episode duration to [0,1] and cross-correlate ipsi vs. contra channels; the peak must appear within a small lag window. Large shifts expose segmentation offsets or swapped sides.
- Physiologic Guardrails: Enforce simple min/max bounds per joint, moment, and GRF so inverted channels fail quickly (e.g., knee flexion −20° to 160°, ankle moment ±3 Nm/kg).
Failing these checks should prompt maintainers to request converter fixes before accepting a time-indexed dataset.
Time-series thresholds live in contributor_tools/validation_ranges/time_structural.yaml. Maintainers can tune tolerances (baseline window, correlation targets, guardrail ranges) without touching code. The quick validation CLI now reports whether a dataset was validated in phase or time mode and summarises structural issues per task. Plot generation remains phase-only; time datasets surface textual diagnostics instead.
Where Things Are¶
- Converters:
contributor_tools/conversion_scripts/ - Outputs:
converted_datasets/ - Validation engine:
internal/validation_engine/validator.py - Validation ranges:
contributor_tools/validation_ranges/ - Maintainer note: all tooling (
interactive_validation_tuner.py,quick_validation_check.py, andmanage_dataset_documentation.py) now calls the sharedValidatorbackend. Adjust validation rules or range loading logic in one place and every workflow (GUI, CLI, doc generation) stays in sync. - Python API:
src/locohub/locomotion_data.py
Workflows¶
Standard PR Review Flow¶
- Contributor submits PR with dataset + documentation
- Review submission - Check files, validation, metadata
- Request changes if needed (missing info, low validation)
- Merge when ready - Documentation is already complete!
Maintenance Tasks¶
- Update validation ranges: Edit YAML → have contributors re-run validation
- Add new variables: Update
feature_constants.py→ update converters - Fix documentation: Direct edits to
docs/datasets/*.mdfiles - Archive datasets: Move old docs to
archived/subdirectory
PyPI Release Checklist¶
- Bump version in both
pyproject.tomlandsrc/locohub/__init__.py. - Update changelog/notes (e.g.,
docs/maintainers/index.mdor release draft). - Clean previous builds:
rm -rf dist src/locohub.egg-info. - Build artifacts:
python -m build(make surewheelis installed; add--no-isolationif you already have the build dependencies locally).- Verify metadata:
python -m twine check dist/*. - Smoke test in a fresh environment:
python -m venv .venv-release . .venv-release/bin/activate pip install --upgrade pip pip install dist/locohub-<version>.whl python -c "import locohub; print(locohub.__version__)" deactivate - TestPyPI dry run:
python -m twine upload --repository testpypi dist/* pip install --index-url https://test.pypi.org/simple/ locohub==<version> - Publish to PyPI once the smoke test passes:
python -m twine upload dist/*. - Tag and announce: push the git tag, update documentation (
README.md, release notes), and notify contributors.
Contributor Tools at a Glance¶
Quick references for the contributor-facing scripts maintainers should recognize, including the unified submission workflow.
`create_clean_dataset.py` — Filters stride data using the validation engine and writes a cleaned parquet copy.
flowchart TD
A[Start CLI] --> B[Parse dataset/ranges/exclusions]
B --> C{Dataset file exists?}
C -- No --> Z[Exit with error]
C -- Yes --> D[Derive output name]
D --> E[Load dataset with LocomotionData]
E --> F[Validate requested exclude columns]
F --> G{Output exists?}
G -- No --> H[Init Validator with ranges]
G -- Yes --> I{Overwrite confirmed?}
I -- No --> Z
I -- Yes --> H
H --> J[Filter each task: remove failing strides]
J --> K[Drop excluded columns and save parquet]
K --> L[Report pass rate + output]
L --> M[Return exit code]
`manage_dataset_documentation.py` — Unified contributor workflow for validation, plots, and documentation.
Generates or refreshes everything a contributor needs for a dataset page. The script derives a dataset slug from the parquet file name, stores metadata in `docs/datasets/_metadata/`, writes the tab wrapper (``add-dataset` subcommand
Primary entry point today. Collects metadata (prompts or file), runs validation, writes dataset docs, persists metadata YAML, regenerates tables, and outputs the submission checklist. Use `python contributor_tools/manage_tasks.py` when a contributor proposes a brand-new base family. Pathology suffixes (e.g., `_stroke`, `_pd`) rely on naming convention instead—have them clone the base ranges in the interactive tuner, save a cohort-specific YAML, and remind them that validators never fall back to the able-bodied envelopes.flowchart TD
A[Start CLI] --> B[Parse dataset and options]
B --> C{Metadata file supplied?}
C -- Yes --> D[Load YAML or JSON metadata]
C -- No --> E[Prompt contributor for fields]
D --> F
E --> F[Assemble metadata payload]
F --> F1["Ensure base families exist in registry (manage_tasks.py)"]
F1 --> G[Run validator on parquet]
G --> H{Validation passed?}
H -- No --> I[Capture issues but continue]
H -- Yes --> J[Store pass statistics]
I --> K
J --> K[Embed validation summary]
K --> L[Render overview & validation markdown]
L --> M[Write metadata YAML and checklist]
M --> N[Regenerate dataset tables via markers]
N --> O[Exit with status]
`update-documentation` subcommand
Fast path to refresh the overview markdown and metadata from the latest parquet without rerunning validation. Loads the stored YAML (or an override), re-extracts tasks/subjects, prompts you with the current values (press Enter to keep), rewrites `docs/datasets/flowchart TD
A[Start CLI] --> B[Resolve short code]
B --> C[Load existing metadata YAML]
C --> D["Resolve dataset path (flag or last_dataset_path)"]
D --> E[Extract tasks & subjects]
E --> F[Regenerate overview markdown]
F --> G[Write metadata YAML]
G --> H[Refresh dataset tables]
H --> I[[Done]]
**Command:**
python contributor_tools/manage_dataset_documentation.py update-documentation \
--short-code UM21 [--dataset converted_datasets/umich_2021_phase_dirty.parquet]
`update-validation` subcommand
Runs validation again, snapshots the active ranges into the dataset folder, rebuilds the validation report, and refreshes plots. Also rewrites the overview page so pass-rate badges stay current.flowchart TD
A[Start CLI] --> B[Resolve short code]
B --> C[Load metadata]
C --> D[Resolve dataset + ranges]
D --> E[Run validator]
E --> F[Snapshot ranges YAML]
F --> G[Regenerate validation report]
G --> H[Regenerate overview + tables]
H --> I[Refresh plots gallery]
I --> J[[Done]]
**Command:**
python contributor_tools/manage_dataset_documentation.py update-validation \
--short-code UM21 [--dataset converted_datasets/umich_2021_phase_dirty.parquet]
`remove-dataset` subcommand
Destructive cleanup. Deletes overview + validation markdown, metadata YAML, validation plots, the ranges snapshot, and the submission checklist so the dataset can be rebuilt from scratch. Optional flag also removes converted parquet files.flowchart TD
start([Start CLI]) --> resolve[Resolve short code]
resolve --> purgeDocs[Delete docs/metadata/ranges]
purgeDocs --> purgePlots[Delete validation plots + checklist]
purgePlots --> deleteParquet{--remove-parquet?}
deleteParquet -- No --> refreshTables[Refresh dataset tables]
deleteParquet -- Yes --> rmParquet[Delete converted parquet]
rmParquet --> refreshTables
refreshTables --> report[Print removed paths]
report --> done([Exit])
**Command:**
python contributor_tools/manage_dataset_documentation.py remove-dataset \
--short-code UM21 [--remove-parquet]
`interactive_validation_tuner.py` — GUI tool for hands-on validation range tuning.
Helps contributors diagnose failing variables and author custom range YAMLs. Requires tkinter/display support; useful when datasets target special populations and need bespoke envelopes before re-running `add-dataset`.flowchart TD
A[Start CLI] --> B[Check tkinter and display availability]
B -- Missing --> C[Print setup instructions and exit]
B -- Available --> D[Launch tuner window]
D --> E[Load validation YAML and dataset]
E --> F[Render draggable range boxes]
F --> G[Contributor adjusts ranges / toggles options]
G --> H[Preview pass/fail changes]
H --> I{Save ranges?}
I -- Yes --> J[Export updated YAML]
I -- No --> K[Keep editing]
J --> L[Continue editing or close]
K --> L
L --> M[Exit application]
`quick_validation_check.py` — Fast validator that prints stride pass rates with optional plot rendering.
flowchart TD
A[Start CLI] --> B[Parse CLI options]
B --> C{Dataset file and ranges file exist?}
C -- No --> Z[Exit with error]
C -- Yes --> D[Initialize Validator]
D --> E[Run validation]
E --> F[Print pass summary]
F --> G{Plot flag enabled?}
G -- No --> H[Exit with status code]
G -- Yes --> I[Render interactive or saved plots]
I --> H
Dataset Documentation Pipeline¶
Maintainers never hand-edit the generated dataset pages. Everything flows from a small set of source files that the CLI assembles into the published Markdown and assets.
flowchart TD
parquet[converted_datasets/<dataset_name>_phase.parquet
or <dataset_name>_time.parquet] --> tool[manage_dataset_documentation.py add-dataset/update-*]
metadata_src[docs/datasets/_metadata/<short_code>.yaml
existing snapshot or CLI prompts] --> tool
shortcode[--short-code <short_code>
CLI flag or inferred slug] --> tool
ranges[contributor_tools/validation_ranges/*.yaml] --> tool
tool --> metadata_out[docs/datasets/_metadata/<short_code>.yaml
authoritative snapshot]
tool --> wrapper[docs/datasets/<short_code>.md
tab wrapper]
tool --> docbody[docs/datasets/.generated/<short_code>_documentation.md]
tool --> valbody[docs/datasets/.generated/<short_code>_validation.md]
tool --> plots[docs/datasets/validation_plots/<short_code>/*]
metadata_out --> tables[Dataset tables regenerated in README.md,
docs/index.md, docs/datasets/index.md]
Key touch points:
- Dataset content comes from the parquet the CLI reads. Point
--datasetat the clean phase file when possible; the tool falls back to_clean,_raw, or_dirtyvariants when needed. docs/datasets/_metadata/<short_code>.yamlis the single source of truth for display text, validation stats, download links, and range references. Edit this file (or pass--metadata-file) rather than changing the generated Markdown.- The
_generatedMarkdown files are regenerated on every run. To change wording, update the metadata values or adjust the rendering helpers inside the CLI. docs/datasets/<short_code>.mdonly embeds the generated snippets. If it diverges from the template it will be rewritten the next time the CLI runs.update-documentationrefreshes metadata-driven text without re-running validation;update-validationtriggers validation, snapshots ranges, and overwrites plots underdocs/datasets/validation_plots/<short_code>/.- Use
remove-datasetwhen you need to wipe the generated files before re-adding a dataset; this leaves the parquet in place unless--remove-parquetis supplied. - The
--short-codeflag feeds the<short_code>placeholders; when omitted, the CLI infers it from the parquet stem or existing metadata snapshot.
Documentation Website Architecture¶
Everything on the public site is generated from a small collection of source folders:
docs/
├── datasets/
│ ├── _generated/ # Snippet bodies for docs + validation tabs
│ ├── _metadata/ # YAML snapshots driving tables & cards
│ ├── validation_plots/ # Latest validation images + index.md per dataset
│ └── <dataset>.md # Tab wrapper embedding documentation & validation snippets
├── maintainers/ # Maintainer handbook (this page)
├── reference/ # Data standard spec and units
├── contributing/ # Contributor step-by-step guide
└── index.md # Homepage (contains dataset table markers)
Key mechanics to remember:
- MkDocs reads mkdocs.yml, which pulls in docs/ and enables the mermaid2 plugin for diagrams.
- manage_dataset_documentation.py add-dataset is the authoritative writer. It:
1. Loads or prompts for metadata and writes docs/datasets/_metadata/<slug>.yaml.
2. Runs validation, storing summary text and stats in the metadata dict.
3. Renders docs/datasets/<slug>.md (tab wrapper) plus _generated/<slug>_documentation.md and _generated/<slug>_validation.md (snippet bodies consumed by the wrapper tabs).
4. Regenerates the dataset tables inside the marker pairs (<!-- DATASET_TABLE_START --> / <!-- DATASET_TABLE_END -->) in README.md, docs/index.md, and docs/datasets/index.md.
5. Writes docs/datasets/validation_plots/<slug>/ (images plus index.md). Only the most recent plots are kept; git history provides older versions.
- The dataset tables now expose a single documentation link (covering both tabs) alongside clean/full dataset download links.
- Running mkdocs serve or mkdocs build does not invoke regeneration—it only renders the already-generated Markdown.
- If you hand-edit generated Markdown, mirror the change in the metadata or template; the next add-dataset run will otherwise overwrite it.
Environment¶
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt