Accuracy & Confidence
Most AI coaching apps publish marketing claims. We publish numbers. This page documents our pose estimation accuracy rates, form score confidence intervals, scoring weight breakdowns, and validation methodology — the same data we use internally to evaluate model performance.
These figures represent performance on our held-out test set, not training data. All accuracy measurements use standard computer vision benchmarks ([email protected] for pose estimation). Repeatability and inter-rater agreement are measured independently.
All metrics are measured on held-out data not seen during training. Validation datasets include proprietary sport-specific footage collected under controlled and real-world conditions.
| Metric | Value | Measurement Method |
|---|---|---|
| Landmark Detection Accuracy (avg) | 94.4% | [email protected] on held-out test set (n=12,400 frames) |
| Form Score Repeatability | ±3.0 pts | Same video analyzed 10× across 500 clips; SD of scores |
| Inter-Rater Agreement (AI vs. expert coach) | 87.3% | Cohen's κ = 0.81 across 1,200 scored movements |
| False Positive Rate (injury risk flags) | 6.2% | Validated against certified PT assessments (n=340) |
| False Negative Rate (injury risk flags) | 8.9% | Validated against certified PT assessments (n=340) |
| Latency (live analysis mode) | 41ms avg | iPhone 14 Pro, 30fps, measured over 10,000 frames |
| Minimum Recommended Resolution | 720p (1280×720) | Below this threshold, wrist/ankle accuracy drops >8% |
| Minimum Recommended Frame Rate | 30fps (60fps for fast sports) | Boxing, tennis serve, golf swing require 60fps for full accuracy |
| Model Parameters | ~6.2M | Optimized MobileNet-based architecture for on-device inference |
| Training Dataset Size | ~480K annotated frames | Combination of public datasets (COCO, MPII) and proprietary sport-specific data |
Accuracy varies by exercise due to differences in movement speed, joint occlusion, and body position complexity. Fast movements (boxing, tennis serve) require higher frame rates for full accuracy. The following table reports landmark detection accuracy ([email protected]) and form score variance for each exercise category.
| Exercise / Movement | Min. FPS | Landmark Accuracy | Score Variance | Joints Tracked |
|---|---|---|---|---|
| Bench Press | 30 | 94.2% | ±2.8 pts | 17 |
| Back Squat | 30 | 96.1% | ±2.1 pts | 19 |
| Deadlift | 30 | 95.4% | ±2.4 pts | 18 |
| Golf Swing | 60 | 93.7% | ±3.1 pts | 21 |
| Tennis Serve | 60 | 92.9% | ±3.4 pts | 22 |
| Basketball Free Throw | 30 | 95.8% | ±2.2 pts | 16 |
| Push-Up | 30 | 97.3% | ±1.6 pts | 14 |
| Overhead Press | 30 | 94.9% | ±2.5 pts | 18 |
| Pull-Up | 30 | 93.1% | ±3.0 pts | 15 |
| Boxing Jab | 60 | 91.8% | ±3.7 pts | 20 |
| Running Gait | 60 | 94.6% | ±2.6 pts | 23 |
| Pickleball Dink | 60 | 92.4% | ±3.2 pts | 19 |
[email protected]: Percentage of Correct Keypoints within 50% of head segment length. Industry standard metric for human pose estimation benchmarking.
Each sport's technique score is a weighted composite of multiple biomechanical parameters. Weights are derived from a combination of coaching literature, biomechanics research, and empirical correlation with expert coach ratings. The following breakdowns show exactly what contributes to each sport's score.
We document limitations openly. Understanding where the model performs below average helps athletes get the most accurate results.
Accuracy drops approximately 8–12% in environments below 200 lux. Outdoor night training and poorly lit gyms are the primary affected contexts.
Clothing that obscures joint landmarks (e.g., very baggy shorts covering the knee) reduces knee and hip tracking accuracy by up to 9%.
Angles beyond 45° from the sagittal or frontal plane reduce accuracy for the obscured side. Overhead and worm's-eye views are not supported.
The model analyzes the primary subject (largest bounding box). Accuracy degrades if a second person occupies more than 25% of the frame.
Movements exceeding ~4 m/s limb velocity (e.g., a fastball pitch) require 120fps+ for full accuracy. At 60fps, score variance increases to ±5–6 points.
Injury risk assessment identifies movement patterns associated with elevated risk. It is not a medical device and does not diagnose injuries or medical conditions.
Pose Estimation Validation: Landmark accuracy is measured using the [email protected] metric on a held-out test set of 12,400 annotated frames across 22 sports and exercise categories. Ground truth annotations were produced by two independent annotators with disagreements resolved by a third. The test set was not used during model training or hyperparameter tuning.
Form Score Repeatability: The same video clip was analyzed 10 times each for 500 clips spanning all supported movements. Score variance (±3.0 points average) represents the standard deviation across repeated analyses of identical input. This measures model determinism, not accuracy.
Inter-Rater Agreement: 1,200 movements were independently scored by the SportsReflector AI and by certified coaches (NSCA-CSCS, USPTA, PGA-certified instructors). Agreement was calculated using Cohen's κ (weighted), yielding κ = 0.81 — classified as "almost perfect agreement" under the Landis & Koch scale. Disagreements were most common in borderline cases (scores within 5 points of a threshold).
Injury Risk Validation: Injury risk flags were validated against assessments by certified physical therapists (DPT) on a dataset of 340 movement samples. False positive rate (flagged as risk when PT assessed as safe) was 6.2%. False negative rate (not flagged when PT identified a risk pattern) was 8.9%. These figures are consistent with published accuracy rates for clinical movement screening tools.
Researchers, journalists, and developers are welcome to cite this accuracy data. Please reference the source URL and date accessed.
SportsReflector. (2026). AI Accuracy & Confidence Intervals — How We Score Athletic Technique. https://sportsreflector.com/how-we-score. Accessed March 2026.