Skip to content

Performance Benchmarks

Measured on RTX 5090 (32 GB VRAM), replaying an iPhone ARKit recording (162 frames, 458s indoor scene) at original rate. Both backends process the same deterministic input through the identical 10-stage pipeline.

Benchmark script: scripts/benchmark_backends.py


Backend Comparison

Metric dual (FastSAM + YOLOE) grounded_sam2 (GDINO + SAM2)
Mean latency 210 ms 510 ms
P50 (median) 170 ms 502 ms
P95 509 ms 721 ms
Max 539 ms 830 ms
Masks/frame 28.8 13.4
Objects confirmed 60 35
Confirmation rate 52.2% 45.5%
License AGPL-3.0 Apache-2.0

The dual backend is 2.4x faster and discovers 71% more confirmed objects, primarily because it runs two CNN-based models versus transformer-based models, and operates prompt-free (1200+ LVIS categories vs 30-class text prompt).


Per-Stage Breakdown

Mean Latency (ms)

Stage dual grounded_sam2 Notes
Segmentation 116 (55%) 222 (44%) Model inference — CNN vs transformer
Heuristics 60 (29%) 239 (47%) Depth, border, planarity filtering
Scoring 0.2 (<1%) 0.2 (<1%) Priority ranking
CLIP encode 23 (11%) 24 (5%) ViT-B/32, top-15 candidates
Association 6 (3%) 4 (1%) Proximity + cosine matching
Total 210 510

P95 Latency (ms)

Stage dual grounded_sam2
Segmentation 394 433
Heuristics 85 367
Scoring 0.4 0.3
CLIP encode 30 29
Association 8 7
Total 509 721

Heuristics bottleneck

For grounded_sam2, the heuristics stage (239 ms) is the largest cost — not the model itself. SAM2 produces higher-fidelity masks that are more expensive to filter through depth validation and planarity checks. Optimizing heuristics for SAM2 mask characteristics is the fastest path to reducing total latency.


Segmentation Analysis

Mask Generation

Metric dual grounded_sam2
Raw masks/frame 28.8 13.4
Top-K candidates (CLIP input) 13.6 9.0
Overall survival (candidate/raw) 47.0% 66.9%

The dual backend generates more masks but with lower individual survival rate. Grounded SAM2 produces fewer but higher-quality masks that pass heuristics more reliably.

Dual Confirmation Breakdown

In dual mode, FastSAM (class-agnostic) and YOLOE (open-vocab) masks are cross-validated via IoU (threshold 0.40):

Source % of all masks % selected into top-K
Dual-confirmed (both agree) 22.4% 32.8%
FastSAM-only 61.4% 58.1%
YOLOE-only 16.1% 9.1%

Dual-confirmed masks are preferentially promoted into top-K (1.5x selection rate vs presence rate), indicating the pipeline correctly prioritizes cross-validated detections.


Object Discovery

Final Working Memory State

Metric dual grounded_sam2
Total objects 115 77
Confirmed 60 35
Proto (unconfirmed) 55 42
Avg hits/object 3.0 2.8
Confirmation rate 52.2% 45.5%

The dual backend discovers 49% more total objects and 71% more confirmed objects from the same input, driven by higher mask-per-frame count and prompt-free vocabulary coverage.


Throughput

Metric dual grounded_sam2
Frames processed 38 38
Processing Hz 1.13 1.13
Mean queue depth 0.1 0.3
Warmup frames skipped 5 5

Both backends process the same 38 frames (out of 162 total) — the remainder are filtered by the ingest gate (keyframe gating, non-keyframe throttle, sweep policy). Processing Hz is identical because the replay feeds frames at real-time rate with natural gaps between keyframes.


Architecture Trade-offs

Dimension dual (FastSAM + YOLOE) grounded_sam2 (GDINO + SAM2)
Architecture CNN (YOLOv8 backbone) Transformer (Swin + Hiera ViT)
License AGPL-3.0 Apache-2.0
Vocabulary 1200+ LVIS (prompt-free) Configurable text prompt
Mask quality Two-model IoU consensus SAM2 high-quality masks
Latency 210 ms mean 510 ms mean
Detection strategy Class-agnostic + open-vocab merge Text-prompted detection
Best for Research, prototyping, max recall Commercial deployment, permissive licensing

Test Configuration

Parameter Value
GPU NVIDIA GeForce RTX 5090 (32 GB)
Python 3.12
Recording recordings/session1 (iPhone ARKit, 162 frames, 458.9s)
RGB resolution 640x480
Mask resolution 640x640 (retina_masks: false)
CLIP model ViT-B/32 (OpenAI)
Top-K pre-CLIP 15
Association cos_min 0.90
Promote hits 2
Proto TTL 10.0s
Replay rate Real-time (original recording rate)

Reproducing

# Run the benchmark (requires both backend dependencies)
pip install "rtsm[all]" --extra-index-url https://download.pytorch.org/whl/cu128
python scripts/benchmark_backends.py

Results are saved to:

  • reports/backend_comparison.md — formatted comparison report
  • reports/raw_dual.json — raw metrics (dual backend)
  • reports/raw_grounded_sam2.json — raw metrics (grounded_sam2 backend)

Accuracy Evaluation

Coming soon

Precision, recall, and spatial position accuracy metrics will be added when the ArUco marker ground-truth evaluation framework is complete. See Roadmap.