Performance Benchmarks¶

Measured on RTX 5090 (32 GB VRAM), replaying an iPhone ARKit recording (162 frames, 458s indoor scene) at original rate. Both backends process the same deterministic input through the identical 10-stage pipeline.

Benchmark script: scripts/benchmark_backends.py

Backend Comparison¶

Metric	dual (FastSAM + YOLOE)	grounded_sam2 (GDINO + SAM2)
Mean latency	210 ms	510 ms
P50 (median)	170 ms	502 ms
P95	509 ms	721 ms
Max	539 ms	830 ms
Masks/frame	28.8	13.4
Objects confirmed	60	35
Confirmation rate	52.2%	45.5%
License	AGPL-3.0	Apache-2.0

The dual backend is 2.4x faster and discovers 71% more confirmed objects, primarily because it runs two CNN-based models versus transformer-based models, and operates prompt-free (1200+ LVIS categories vs 30-class text prompt).

Per-Stage Breakdown¶

Mean Latency (ms)¶

Stage	dual	grounded_sam2	Notes
Segmentation	116 (55%)	222 (44%)	Model inference — CNN vs transformer
Heuristics	60 (29%)	239 (47%)	Depth, border, planarity filtering
Scoring	0.2 (<1%)	0.2 (<1%)	Priority ranking
CLIP encode	23 (11%)	24 (5%)	ViT-B/32, top-15 candidates
Association	6 (3%)	4 (1%)	Proximity + cosine matching
Total	210	510

P95 Latency (ms)¶

Stage	dual	grounded_sam2
Segmentation	394	433
Heuristics	85	367
Scoring	0.4	0.3
CLIP encode	30	29
Association	8	7
Total	509	721

Heuristics bottleneck

For grounded_sam2, the heuristics stage (239 ms) is the largest cost — not the model itself. SAM2 produces higher-fidelity masks that are more expensive to filter through depth validation and planarity checks. Optimizing heuristics for SAM2 mask characteristics is the fastest path to reducing total latency.

Segmentation Analysis¶

Mask Generation¶

Metric	dual	grounded_sam2
Raw masks/frame	28.8	13.4
Top-K candidates (CLIP input)	13.6	9.0
Overall survival (candidate/raw)	47.0%	66.9%

The dual backend generates more masks but with lower individual survival rate. Grounded SAM2 produces fewer but higher-quality masks that pass heuristics more reliably.

Dual Confirmation Breakdown¶

In dual mode, FastSAM (class-agnostic) and YOLOE (open-vocab) masks are cross-validated via IoU (threshold 0.40):

Source	% of all masks	% selected into top-K
Dual-confirmed (both agree)	22.4%	32.8%
FastSAM-only	61.4%	58.1%
YOLOE-only	16.1%	9.1%

Dual-confirmed masks are preferentially promoted into top-K (1.5x selection rate vs presence rate), indicating the pipeline correctly prioritizes cross-validated detections.

Object Discovery¶

Final Working Memory State¶

Metric	dual	grounded_sam2
Total objects	115	77
Confirmed	60	35
Proto (unconfirmed)	55	42
Avg hits/object	3.0	2.8
Confirmation rate	52.2%	45.5%

The dual backend discovers 49% more total objects and 71% more confirmed objects from the same input, driven by higher mask-per-frame count and prompt-free vocabulary coverage.

Throughput¶

Metric	dual	grounded_sam2
Frames processed	38	38
Processing Hz	1.13	1.13
Mean queue depth	0.1	0.3
Warmup frames skipped	5	5

Both backends process the same 38 frames (out of 162 total) — the remainder are filtered by the ingest gate (keyframe gating, non-keyframe throttle, sweep policy). Processing Hz is identical because the replay feeds frames at real-time rate with natural gaps between keyframes.

Architecture Trade-offs¶

Dimension	dual (FastSAM + YOLOE)	grounded_sam2 (GDINO + SAM2)
Architecture	CNN (YOLOv8 backbone)	Transformer (Swin + Hiera ViT)
License	AGPL-3.0	Apache-2.0
Vocabulary	1200+ LVIS (prompt-free)	Configurable text prompt
Mask quality	Two-model IoU consensus	SAM2 high-quality masks
Latency	210 ms mean	510 ms mean
Detection strategy	Class-agnostic + open-vocab merge	Text-prompted detection
Best for	Research, prototyping, max recall	Commercial deployment, permissive licensing

Test Configuration¶

Parameter	Value
GPU	NVIDIA GeForce RTX 5090 (32 GB)
Python	3.12
Recording	`recordings/session1` (iPhone ARKit, 162 frames, 458.9s)
RGB resolution	640x480
Mask resolution	640x640 (`retina_masks: false`)
CLIP model	ViT-B/32 (OpenAI)
Top-K pre-CLIP	15
Association `cos_min`	0.90
Promote hits	2
Proto TTL	10.0s
Replay rate	Real-time (original recording rate)

Reproducing¶

# Run the benchmark (requires both backend dependencies)
pip install "rtsm[all]" --extra-index-url https://download.pytorch.org/whl/cu128
python scripts/benchmark_backends.py

Results are saved to:

reports/backend_comparison.md — formatted comparison report
reports/raw_dual.json — raw metrics (dual backend)
reports/raw_grounded_sam2.json — raw metrics (grounded_sam2 backend)

Accuracy Evaluation¶

Coming soon

Precision, recall, and spatial position accuracy metrics will be added when the ArUco marker ground-truth evaluation framework is complete. See Roadmap.