Validation API Reference¶
Comprehensive API documentation for dataset validation, quality assessment, and compliance checking.
Overview¶
The validation API provides tools for:
- Dataset quality assessment against biomechanical expectations
- Variable naming convention validation
- Phase-indexed data structure verification
- Automated validation report generation with visualizations
Core Validation Classes¶
DatasetValidator¶
Main class for validating phase-indexed locomotion datasets.
from lib.validation.dataset_validator_phase import DatasetValidator
# Initialize validator
validator = DatasetValidator('dataset_phase.parquet')
Constructor¶
DatasetValidator(dataset_path, output_dir=None, generate_plots=True)
Parameters:
- dataset_path
(str): Path to phase-based dataset parquet file (must be *_phase.parquet)
- output_dir
(str, optional): Directory to save validation reports
- generate_plots
(bool): Whether to generate validation plots (default: True)
Core Methods¶
run_validation() -> str
¶
Run complete dataset validation pipeline.
Returns:
- str
: Path to generated validation report
Example:
validator = DatasetValidator('gait_data_phase.parquet')
report_path = validator.run_validation()
print(f"Validation complete: {report_path}")
load_dataset() -> LocomotionData
¶
Load and validate dataset structure using LocomotionData library.
Returns:
- LocomotionData
: Loaded dataset object
Raises:
- ValueError
: If dataset format is invalid or required columns missing
validate_dataset(locomotion_data) -> Dict
¶
Validate entire dataset against kinematic and kinetic expectations.
Parameters:
- locomotion_data
(LocomotionData): Loaded dataset object
Returns:
- dict
: Validation results with structure:
{
'total_steps': int,
'valid_steps': int,
'failed_steps': int,
'kinematic_failures': List[Dict],
'kinetic_failures': List[Dict],
'tasks_validated': List[str],
'task_step_counts': Dict[str, Dict]
}
Example:
# Detailed validation workflow
validator = DatasetValidator('dataset_phase.parquet')
locomotion_data = validator.load_dataset()
results = validator.validate_dataset(locomotion_data)
print(f"Total steps: {results['total_steps']}")
print(f"Success rate: {results['valid_steps']/results['total_steps']:.1%}")
print(f"Kinematic failures: {len(results['kinematic_failures'])}")
print(f"Kinetic failures: {len(results['kinetic_failures'])}")
StepClassifier¶
Low-level validation engine for individual step validation.
from lib.validation.step_classifier import StepClassifier
classifier = StepClassifier()
Key Methods¶
validate_data_against_specs(data_array, task, step_task_mapping, validation_type)
¶
Validate step data against specification ranges.
Parameters:
- data_array
(np.ndarray): 3D array of shape (n_steps, 150, n_features)
- task
(str): Task name for validation
- step_task_mapping
(Dict): Mapping from step index to task name
- validation_type
(str): 'kinematic' or 'kinetic'
Returns:
- List[Dict]
: List of validation failure dictionaries
Example:
classifier = StepClassifier()
# Validate kinematic data
failures = classifier.validate_data_against_specs(
data_array=kinematic_data_3d,
task='level_walking',
step_task_mapping={0: 'level_walking', 1: 'level_walking'},
validation_type='kinematic'
)
for failure in failures:
print(f"Step {failure['step']}: {failure['variable']} at phase {failure['phase']}%")
print(f" Value: {failure['value']:.3f}, Expected: {failure['expected_min']:.3f}-{failure['expected_max']:.3f}")
load_validation_ranges_from_specs(validation_type)
¶
Load validation ranges from specification files.
Parameters:
- validation_type
(str): 'kinematic' or 'kinetic'
Returns:
- Dict
: Validation ranges organized by task and variable
Validation Utilities¶
ValidationExpectationsParser¶
Parse validation ranges from markdown specification files.
from lib.validation.validation_expectations_parser import ValidationExpectationsParser
parser = ValidationExpectationsParser()
parse_validation_file(file_path) -> Dict
¶
Parse validation expectations from markdown file.
Parameters:
- file_path
(str): Path to validation specification markdown file
Returns:
- Dict
: Parsed validation ranges
Example:
parser = ValidationExpectationsParser()
kinematic_ranges = parser.parse_validation_file(
'docs/standard_spec/validation_expectations_kinematic.md'
)
# Access specific task and variable ranges
walking_ranges = kinematic_ranges['level_walking']
knee_range = walking_ranges['knee_flexion_angle_contra_rad']
print(f"Knee flexion range at phase 0: {knee_range['phase_0']}")
AutomatedFineTuning¶
Automatically optimize validation ranges based on dataset statistics.
from lib.validation.automated_fine_tuning import AutomatedFineTuning
tuner = AutomatedFineTuning()
tune_validation_ranges(dataset_path, validation_type, percentile_range)
¶
Generate optimized validation ranges from dataset.
Parameters:
- dataset_path
(str): Path to reference dataset
- validation_type
(str): 'kinematic' or 'kinetic'
- percentile_range
(Tuple[float, float]): Percentile range for bounds (e.g., (5, 95))
Returns:
- Dict
: Optimized validation ranges
Example:
tuner = AutomatedFineTuning()
# Generate ranges from high-quality reference dataset
optimized_ranges = tuner.tune_validation_ranges(
dataset_path='reference_dataset_phase.parquet',
validation_type='kinematic',
percentile_range=(2.5, 97.5) # Conservative bounds
)
# Apply to existing specifications
tuner.apply_tuned_ranges(optimized_ranges, 'kinematic')
Advanced Validation Patterns¶
Custom Validation Pipeline¶
def custom_validation_pipeline(dataset_path: str) -> Dict:
"""Custom validation with specific requirements."""
# Initialize components
validator = DatasetValidator(dataset_path, generate_plots=False)
classifier = StepClassifier()
# Load dataset
locomotion_data = validator.load_dataset()
# Custom validation logic
results = {
'dataset_info': {
'subjects': len(locomotion_data.subjects),
'tasks': len(locomotion_data.tasks),
'features': len(locomotion_data.features)
},
'naming_compliance': locomotion_data.get_validation_report(),
'quality_by_task': {}
}
# Task-specific validation
for task in locomotion_data.tasks:
task_results = {'subjects': {}}
for subject in locomotion_data.subjects:
# Get kinematic data
data_3d, features = locomotion_data.get_cycles(subject, task,
locomotion_data.ANGLE_FEATURES)
if data_3d is None:
continue
# Validate each step
step_failures = []
for step_idx in range(data_3d.shape[0]):
step_data = data_3d[step_idx:step_idx+1, :, :] # Keep 3D shape
failures = classifier.validate_data_against_specs(
step_data, task, {0: task}, 'kinematic'
)
step_failures.extend(failures)
task_results['subjects'][subject] = {
'total_steps': data_3d.shape[0],
'failed_steps': len(step_failures),
'quality_score': 1.0 - len(step_failures) / data_3d.shape[0]
}
results['quality_by_task'][task] = task_results
return results
# Run custom validation
results = custom_validation_pipeline('my_dataset_phase.parquet')
Batch Dataset Validation¶
def validate_multiple_datasets(dataset_directory: str) -> Dict:
"""Validate all datasets in a directory."""
dataset_paths = list(Path(dataset_directory).glob('*_phase.parquet'))
validation_results = {}
for dataset_path in dataset_paths:
dataset_name = dataset_path.stem
print(f"Validating {dataset_name}...")
try:
validator = DatasetValidator(str(dataset_path), generate_plots=False)
locomotion_data = validator.load_dataset()
results = validator.validate_dataset(locomotion_data)
# Calculate overall quality metrics
quality_score = results['valid_steps'] / results['total_steps'] if results['total_steps'] > 0 else 0
validation_results[dataset_name] = {
'status': 'SUCCESS',
'quality_score': quality_score,
'total_steps': results['total_steps'],
'failure_count': len(results['kinematic_failures']) + len(results['kinetic_failures']),
'tasks': results['tasks_validated']
}
except Exception as e:
validation_results[dataset_name] = {
'status': 'ERROR',
'error': str(e),
'quality_score': 0.0
}
return validation_results
# Validate all datasets
results = validate_multiple_datasets('./converted_datasets/')
# Generate summary
for dataset, result in results.items():
status = result['status']
quality = result.get('quality_score', 0)
print(f"{dataset}: {status} (Quality: {quality:.1%})")
Real-time Validation Monitoring¶
class ValidationMonitor:
"""Real-time validation monitoring for data streams."""
def __init__(self):
self.classifier = StepClassifier()
self.validation_history = []
self.alert_threshold = 0.8 # Alert if quality drops below 80%
def validate_incoming_step(self, step_data: np.ndarray, task: str) -> Dict:
"""Validate a single incoming step."""
# Ensure step_data is 3D: (1, 150, n_features)
if step_data.ndim == 2:
step_data = step_data.reshape(1, 150, -1)
# Validate step
failures = self.classifier.validate_data_against_specs(
step_data, task, {0: task}, 'kinematic'
)
quality_score = 1.0 - len(failures) / step_data.shape[2] # Failures per feature
result = {
'timestamp': datetime.now(),
'task': task,
'quality_score': quality_score,
'failure_count': len(failures),
'failures': failures,
'alert': quality_score < self.alert_threshold
}
self.validation_history.append(result)
# Trigger alert if needed
if result['alert']:
self._trigger_quality_alert(result)
return result
def _trigger_quality_alert(self, result: Dict):
"""Handle quality alerts."""
print(f"🚨 QUALITY ALERT: {result['task']} quality at {result['quality_score']:.1%}")
print(f" Failures: {result['failure_count']}")
def get_quality_trends(self, window_size: int = 10) -> Dict:
"""Get recent quality trends."""
if len(self.validation_history) < window_size:
return {'insufficient_data': True}
recent_results = self.validation_history[-window_size:]
qualities = [r['quality_score'] for r in recent_results]
return {
'current_quality': qualities[-1],
'mean_quality': np.mean(qualities),
'quality_trend': 'improving' if qualities[-1] > qualities[0] else 'declining',
'alert_rate': sum(1 for r in recent_results if r['alert']) / len(recent_results)
}
# Usage
monitor = ValidationMonitor()
# Simulate incoming data stream
for i in range(20):
# Generate sample step data (150 points, 6 features)
step_data = np.random.randn(150, 6) * 0.1 # Small random values
result = monitor.validate_incoming_step(step_data, 'level_walking')
if i % 5 == 0: # Check trends every 5 steps
trends = monitor.get_quality_trends()
if not trends.get('insufficient_data'):
print(f"Quality trend: {trends['quality_trend']} (current: {trends['current_quality']:.1%})")
Validation Configuration¶
Custom Validation Ranges¶
# Define custom validation ranges
custom_ranges = {
'level_walking': {
'knee_flexion_angle_contra_rad': {
'phase_0': {'min': -0.1, 'max': 0.3}, # Heel strike
'phase_25': {'min': 0.0, 'max': 0.8}, # Loading response
'phase_50': {'min': 0.5, 'max': 1.2}, # Mid-swing
'phase_75': {'min': 0.2, 'max': 0.9} # Terminal swing
}
}
}
# Apply custom ranges
classifier = StepClassifier()
classifier.kinematic_expectations = custom_ranges
# Use in validation
failures = classifier.validate_data_against_specs(
data_array, 'level_walking', step_mapping, 'kinematic'
)
Validation Report Customization¶
def generate_custom_report(validation_results: Dict, output_path: str):
"""Generate custom validation report."""
with open(output_path, 'w') as f:
f.write("# Custom Validation Report\n\n")
# Executive summary
total_steps = validation_results['total_steps']
valid_steps = validation_results['valid_steps']
success_rate = valid_steps / total_steps if total_steps > 0 else 0
f.write(f"**Dataset Quality**: {success_rate:.1%}\n")
f.write(f"**Total Steps**: {total_steps}\n")
f.write(f"**Valid Steps**: {valid_steps}\n\n")
# Task breakdown
f.write("## Task Analysis\n\n")
for task in validation_results['tasks_validated']:
task_counts = validation_results['task_step_counts'].get(task, {})
task_total = task_counts.get('total', 0)
task_valid = task_counts.get('valid', 0)
task_rate = task_valid / task_total if task_total > 0 else 0
f.write(f"### {task.replace('_', ' ').title()}\n")
f.write(f"- Success Rate: {task_rate:.1%}\n")
f.write(f"- Total Steps: {task_total}\n")
f.write(f"- Valid Steps: {task_valid}\n\n")
# Failure analysis
kinematic_failures = validation_results.get('kinematic_failures', [])
kinetic_failures = validation_results.get('kinetic_failures', [])
if kinematic_failures or kinetic_failures:
f.write("## Failure Summary\n\n")
# Group failures by variable
failure_counts = {}
for failure in kinematic_failures + kinetic_failures:
var = failure['variable']
failure_counts[var] = failure_counts.get(var, 0) + 1
f.write("| Variable | Failure Count |\n")
f.write("|----------|---------------|\n")
for var, count in sorted(failure_counts.items(), key=lambda x: x[1], reverse=True):
f.write(f"| {var} | {count} |\n")
# Usage
validator = DatasetValidator('dataset_phase.parquet')
locomotion_data = validator.load_dataset()
results = validator.validate_dataset(locomotion_data)
generate_custom_report(results, 'custom_validation_report.md')
Error Handling and Debugging¶
import logging
# Configure validation logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('validation')
def robust_validation(dataset_path: str) -> Dict:
"""Validation with comprehensive error handling."""
try:
validator = DatasetValidator(dataset_path)
# Load with error handling
try:
locomotion_data = validator.load_dataset()
logger.info(f"Successfully loaded dataset: {dataset_path}")
except ValueError as e:
logger.error(f"Dataset loading failed: {e}")
return {'status': 'LOAD_ERROR', 'error': str(e)}
# Validate with error handling
try:
results = validator.validate_dataset(locomotion_data)
logger.info(f"Validation completed: {results['valid_steps']}/{results['total_steps']} steps valid")
return {'status': 'SUCCESS', 'results': results}
except Exception as e:
logger.error(f"Validation failed: {e}")
return {'status': 'VALIDATION_ERROR', 'error': str(e)}
except Exception as e:
logger.critical(f"Unexpected error: {e}")
return {'status': 'CRITICAL_ERROR', 'error': str(e)}
# Usage with error handling
result = robust_validation('problematic_dataset.parquet')
if result['status'] == 'SUCCESS':
validation_results = result['results']
# Process successful validation
elif result['status'] == 'LOAD_ERROR':
print(f"Cannot load dataset: {result['error']}")
elif result['status'] == 'VALIDATION_ERROR':
print(f"Validation failed: {result['error']}")
else:
print(f"Critical error: {result['error']}")
Performance Optimization¶
# Efficient validation for large datasets
def optimized_large_dataset_validation(dataset_path: str, sample_rate: float = 0.1):
"""Validate large datasets efficiently using sampling."""
validator = DatasetValidator(dataset_path, generate_plots=False)
locomotion_data = validator.load_dataset()
# Sample subjects for faster validation
total_subjects = len(locomotion_data.subjects)
sample_size = max(1, int(total_subjects * sample_rate))
sampled_subjects = np.random.choice(locomotion_data.subjects, sample_size, replace=False)
print(f"Validating {sample_size}/{total_subjects} subjects ({sample_rate:.1%} sample)")
# Create temporary dataset with sampled subjects
df_sample = locomotion_data.df[locomotion_data.df['subject'].isin(sampled_subjects)]
# Use efficient validation on sample
sample_validation_results = {}
for subject in sampled_subjects:
for task in locomotion_data.tasks:
data_3d, features = locomotion_data.get_cycles(subject, task)
if data_3d is not None:
valid_mask = locomotion_data.validate_cycles(subject, task)
sample_validation_results[(subject, task)] = {
'quality_score': np.sum(valid_mask) / len(valid_mask)
}
# Estimate full dataset quality
quality_scores = [r['quality_score'] for r in sample_validation_results.values()]
estimated_quality = np.mean(quality_scores) if quality_scores else 0.0
return {
'estimated_quality': estimated_quality,
'sample_size': sample_size,
'total_subjects': total_subjects,
'sample_results': sample_validation_results
}
# Usage for large datasets
result = optimized_large_dataset_validation('large_dataset_phase.parquet', sample_rate=0.2)
print(f"Estimated dataset quality: {result['estimated_quality']:.1%}")
Next Steps¶
- Integration Examples: See Integration Guides
- LocomotionData API: Check LocomotionData API
- Developer Tools: Review Developer Workflows