Example: Gtech 2023 Dataset Conversion¶
Complete walkthrough of converting the Georgia Tech 2023 AddBiomechanics dataset from multiple subject files to standardized parquet format.
Dataset Overview¶
- Source: AddBiomechanics processed data
- Format: Multiple B3D files per subject
- Subjects: 10+ subjects (AB01-AB13)
- Tasks: Walking, running, stairs, sit-to-stand
- Challenge: Large files, complex structure
Step 1: Obtain the Data¶
The Gtech 2023 dataset comes from AddBiomechanics processing:
# Download structure
Gtech_2023/
├── AB01_data.b3d
├── AB02_data.b3d
├── ...
└── AB13_data.b3d
Each B3D file contains: - Multiple trials per subject - Mixed tasks and conditions - Time-series biomechanical data
Step 2: Understand the Data Structure¶
B3D files have a hierarchical structure:
# B3D file structure
Subject_Data
├── trials
│   ├── trial_001
│   │   ├── markers      # 3D marker positions
│   │   ├── forces       # Ground reaction forces
│   │   ├── kinematics   # Joint angles
│   │   └── kinetics     # Joint moments
│   ├── trial_002
│   └── ...
└── metadata
    ├── subject_info
    └── trial_labels
Key challenges: - Large file sizes (>1GB per subject) - Mixed sampling frequencies - Task labels need parsing - Memory-intensive processing
Step 3: Conversion Script¶
The conversion uses Python with optimized chunking:
convert_gtech_all_to_parquet.py¶
#!/usr/bin/env python3
"""
Convert Gtech 2023 AddBiomechanics data to standardized format.
Handles large B3D files with memory-efficient processing.
"""
import numpy as np
import pandas as pd
from pathlib import Path
import h5py
from tqdm import tqdm
import gc
def convert_gtech_to_parquet():
    """
    Main conversion function with memory optimization.
    """
    # Get all B3D files
    b3d_files = list(Path('.').glob('AB*.b3d'))
    # Process each subject separately to manage memory
    all_subjects_data = []
    for b3d_file in tqdm(b3d_files, desc="Processing subjects"):
        subject_id = f"Gtech_2023_{b3d_file.stem}"
        # Process with chunking
        subject_data = process_subject_chunked(b3d_file, subject_id)
        # Save intermediate file to disk
        temp_file = f"temp_{subject_id}.parquet"
        subject_data.to_parquet(temp_file)
        all_subjects_data.append(temp_file)
        # Clear memory
        del subject_data
        gc.collect()
    # Combine all subjects
    combine_subjects_efficient(all_subjects_data)
def process_subject_chunked(b3d_file, subject_id):
    """
    Process a single subject with memory-efficient chunking.
    """
    with h5py.File(b3d_file, 'r') as f:
        trials = f['trials']
        processed_trials = []
        for trial_name in trials.keys():
            trial_data = trials[trial_name]
            # Parse task from trial name
            task = parse_task_name(trial_name)
            if task is None:
                continue  # Skip non-standard tasks
            # Extract kinematics (process in chunks)
            kinematics = process_kinematics_chunked(trial_data['kinematics'])
            # Add metadata
            kinematics['subject'] = subject_id
            kinematics['task'] = task
            kinematics['trial_id'] = trial_name
            processed_trials.append(kinematics)
    return pd.concat(processed_trials, ignore_index=True)
def process_kinematics_chunked(kinematics_group, chunk_size=10000):
    """
    Process kinematics data in chunks to manage memory.
    """
    # Get data dimensions
    n_frames = kinematics_group['hip_flexion'].shape[0]
    chunks = []
    for start_idx in range(0, n_frames, chunk_size):
        end_idx = min(start_idx + chunk_size, n_frames)
        chunk_data = {
            'time': np.arange(start_idx, end_idx) / 100.0,  # 100 Hz sampling
            'knee_flexion_angle_ipsi_rad': 
                kinematics_group['knee_flexion'][start_idx:end_idx, 0],
            'knee_flexion_angle_contra_rad': 
                kinematics_group['knee_flexion'][start_idx:end_idx, 1],
            'hip_flexion_angle_ipsi_rad': 
                kinematics_group['hip_flexion'][start_idx:end_idx, 0],
            'hip_flexion_angle_contra_rad': 
                kinematics_group['hip_flexion'][start_idx:end_idx, 1],
        }
        chunks.append(pd.DataFrame(chunk_data))
    return pd.concat(chunks, ignore_index=True)
def parse_task_name(trial_name):
    """
    Parse standardized task name from trial label.
    """
    trial_lower = trial_name.lower()
    # Task mapping based on trial names
    if 'walk' in trial_lower and 'incline' not in trial_lower:
        return 'level_walking'
    elif 'incline' in trial_lower:
        return 'incline_walking'
    elif 'decline' in trial_lower:
        return 'decline_walking'
    elif 'run' in trial_lower:
        return 'run'
    elif 'stair' in trial_lower and 'up' in trial_lower:
        return 'up_stairs'
    elif 'stair' in trial_lower and 'down' in trial_lower:
        return 'down_stairs'
    elif 'sit' in trial_lower and 'stand' in trial_lower:
        return 'sit_to_stand'
    elif 'squat' in trial_lower:
        return 'squats'
    else:
        return None  # Unknown task
def combine_subjects_efficient(temp_files):
    """
    Combine subject files efficiently using chunked reading.
    """
    # Read and concatenate in chunks
    chunk_size = 50000
    output_file = '../../converted_datasets/gtech_2023_time.parquet'
    first_file = True
    for temp_file in tqdm(temp_files, desc="Combining subjects"):
        df = pd.read_parquet(temp_file)
        if first_file:
            df.to_parquet(output_file, index=False)
            first_file = False
        else:
            # Append mode
            existing = pd.read_parquet(output_file)
            combined = pd.concat([existing, df], ignore_index=True)
            combined.to_parquet(output_file, index=False)
        # Clean up temp file
        Path(temp_file).unlink()
if __name__ == "__main__":
    convert_gtech_to_parquet()
Step 4: Convert to Phase-Indexed¶
Since the Gtech data is time-indexed, convert to phase:
# Convert time to phase (150 points per cycle)
python conversion_generate_phase_dataset.py \
    converted_datasets/gtech_2023_time.parquet
# Creates: gtech_2023_phase.parquet
Step 5: Handle Memory Issues¶
For large datasets, use these optimization strategies:
Strategy 1: Process subjects individually¶
# Instead of loading all at once
for subject_file in subject_files:
    process_and_save_individually(subject_file)
# Then combine
combine_saved_files()
Strategy 2: Use Dask for parallel processing¶
import dask.dataframe as dd
# Read parquet in parallel
ddf = dd.read_parquet('gtech_2023_time.parquet')
# Process in parallel
result = ddf.groupby(['subject', 'task']).apply(
    process_function, meta=output_schema
)
# Save
result.to_parquet('gtech_2023_processed.parquet')
Strategy 3: Stream processing¶
def stream_process_large_file(input_file, output_file):
    """
    Process file in streaming fashion.
    """
    reader = pd.read_parquet(input_file, chunksize=10000)
    first_chunk = True
    for chunk in reader:
        processed = process_chunk(chunk)
        if first_chunk:
            processed.to_parquet(output_file)
            first_chunk = False
        else:
            # Append to existing
            append_to_parquet(processed, output_file)
Step 6: Validate¶
# Validate the phase-indexed dataset
python contributor_tools/create_dataset_validation_report.py \
    --dataset converted_datasets/gtech_2023_phase.parquet
# Expected output
Validation Report: gtech_2023_phase
====================================
Overall Status: PASSED ✓ (91.2%)
Phase Structure: Valid (150 points per cycle)
Tasks Validated: 8/8
Minor violations in stair tasks (expected for this dataset).
Key Lessons from This Example¶
Challenges Faced¶
- Large file sizes: B3D files >1GB each
- Memory limitations: Can't load all subjects at once
- Complex structure: Nested HDF5/B3D format
- Task parsing: Trial names need interpretation
Solutions Applied¶
- Chunked processing: Process data in manageable chunks
- Temporary files: Save intermediate results to disk
- Garbage collection: Explicitly free memory
- Efficient combining: Append mode for final merge
Code Patterns to Reuse¶
- 
Memory-efficient loading: with h5py.File(large_file, 'r') as f: # Process without loading all to memory for key in f.keys(): process_subset(f[key])
- 
Chunked processing: for start in range(0, total_size, chunk_size): end = min(start + chunk_size, total_size) chunk = data[start:end] process_chunk(chunk)
- 
Task name parsing: task_patterns = { 'walk': 'level_walking', 'stair.*up': 'up_stairs', 'stair.*down': 'down_stairs', }
Performance Metrics¶
- Processing time: ~45 minutes for 13 subjects
- Memory usage: Peak 8GB (with chunking)
- Output size: 2.1GB (time), 1.8GB (phase)
- Validation pass rate: 91.2%
Common Issues & Solutions¶
Issue: Memory overflow¶
# Solution: Reduce chunk size
chunk_size = 5000  # Instead of 10000
Issue: Slow processing¶
# Solution: Use parallel processing
from multiprocessing import Pool
with Pool(4) as p:
    results = p.map(process_subject, subject_files)
Issue: Incomplete gait cycles¶
# Solution: Use phase conversion tool
python conversion_generate_phase_dataset.py gtech_2023_time.parquet
# Automatically handles cycle detection
Summary¶
The Gtech 2023 conversion demonstrates: - ✅ Handling large, complex datasets - ✅ Memory-efficient processing strategies - ✅ Converting time to phase indexing - ✅ Robust task name parsing - ✅ Achieving >90% validation pass rate
This example provides patterns for converting large-scale biomechanical datasets with memory constraints.