Configuration¶
RTSM is configured via config/rtsm.yaml. This page covers the main settings grouped by section.
Minimal Configuration¶
A minimal config to get started — most defaults are sensible:
camera:
width: 640
height: 480
fx: 604.6
fy: 604.9
cx: 318.8
cy: 259.2
segmentation:
backend: grounded_sam2 # Apache-2.0 default
io:
receiver: websocket # or zeromq for RealSense + RTABMap
Camera Intrinsics¶
Match these to your RGB-D camera. Values below are for RealSense D435i:
camera:
model: D435i
width: 640
height: 480
aligned_to: color
fx: 604.634705
fy: 604.906738
cx: 318.806030
cy: 259.239777
Segmentation Backend¶
The segmentation backend determines which models extract object masks from each frame. This is the most impactful setting for both performance and licensing.
segmentation:
backend: grounded_sam2 # default (Apache-2.0)
retina_masks: false # false = 640x640 (fast), true = original resolution
| Backend | Description | License | Mean Latency |
|---|---|---|---|
grounded_sam2 |
Grounding DINO + SAM2 (text-prompted) | Apache-2.0 | ~510 ms |
sam2 |
SAM2 auto-mask (class-agnostic, no labels) | Apache-2.0 | ~860 ms |
dual |
FastSAM + YOLOE IoU merge (prompt-free, 1200+ categories) | AGPL-3.0 | ~210 ms |
fastsam |
FastSAM only (class-agnostic) | AGPL-3.0 | ~50 ms |
yoloe |
YOLOE only (open-vocab, prompt-free) | AGPL-3.0 | ~60 ms |
AGPL backends require ultralytics
Backends using FastSAM or YOLOE require the ultralytics package (AGPL-3.0). Install with: pip install "rtsm[gpu-ultralytics]"
Backend-Specific Settings¶
Each backend has its own configuration block:
segmentation:
# Grounding DINO + SAM2 (default)
grounded_sam2:
gdino_model_id: IDEA-Research/grounding-dino-tiny
sam2_model_id: null # null = use sam2.model_id
device: cuda
box_threshold: 0.25 # detection confidence
text_threshold: 0.2 # text matching threshold
vocab: null # null = default 30-class indoor vocab
# SAM2 auto-mask
sam2:
model_id: facebook/sam2.1-hiera-small
device: cuda
points_per_side: 32
pred_iou_thresh: 0.7
stability_score_thresh: 0.92
# FastSAM (AGPL)
fastsam:
model_path: model_store/fastsam/FastSAM-x.pt
device: cuda
imgsz: 640
conf: 0.6
iou: 0.7
# YOLOE (AGPL)
yoloe:
model_path: model_store/yolo/yoloe-26s-seg-pf.pt
device: cuda
imgsz: 640
conf: 0.25
iou: 0.5
# Dual-confirmation settings (backend: dual)
dual:
iou_confirm_threshold: 0.40 # IoU for cross-validation
priority_boost_dual: 0.3 # priority boost for dual-confirmed masks
prefer_mask: yoloe26 # which mask to keep for dual-confirmed
I/O & Receiver¶
RTSM supports two input receiver backends:
WebSocket Receiver (Calabi Lens / ARKit)¶
io:
receiver: websocket
websocket:
host: "0.0.0.0"
port: 8765
require_tracking_normal: true # drop frames with bad tracking
keyframe_every_n: 30 # mark every Nth frame as keyframe
nonkf_min_interval_s: 0.5 # throttle non-keyframes (~2/s)
confidence_threshold: 2 # 0=all, 1=medium+high, 2=high only
ZeroMQ Receiver (RealSense + RTABMap)¶
io:
receiver: zeromq
camera_endpoint: tcp://172.27.240.1:5555 # D435i RGB-D frames
rtabmap_endpoint: tcp://127.0.0.1:6000 # RTABMap pose topics
Unit Conversion¶
If your depth source uses millimeters (e.g., RealSense D435i):
units:
depth_m_per_unit: 0.001 # mm → meters
pose_m_per_unit: 1.0 # RTABMap poses are already in meters
Mask Filtering & Heuristics¶
Controls which masks pass through the perception pipeline:
gates:
min_brightness: 5
min_std: 5
min_depth_valid: 0.35 # min fraction of valid depth pixels
masks:
min_coverage: 0.005
max_coverage: 0.8 # reject wall/floor-sized masks
max_border_fraction: 0.15
filters:
min_area_px: 500 # minimum mask area in pixels
aspect_ratio: [0.2, 5.0]
border_touch_max_pct: 0.15 # reject masks with >15% border contact
depth:
z_min_m: 0.2 # minimum depth (meters)
z_max_m: 8.0 # maximum depth (meters)
valid_min_pct: 0.10 # min valid depth pixel fraction
sigma_max_m: 0.50 # max depth spread
Staging & Top-K¶
Controls how masks are prioritized before CLIP encoding:
staging:
topk_preclip: 15 # max masks sent to CLIP per frame
crop_pad_px: 6 # padding around mask crops
clip_input: 224 # CLIP input resolution
# Priority weights
w_coverage: 1.0
w_border_fraction: -1.0
w_depth_valid: 1.0
w_depth_spread: -0.5
Association¶
Controls how new observations are matched to existing objects:
assoc:
cos_min: 0.90 # minimum cosine similarity for match
gate_dist_base_m: 0.50 # 3D distance gate (meters)
gate_reproj_px: 60 # reprojection distance gate (pixels)
nearest_m_for_cos: 8 # compare cosine against K nearest by distance
use_embeddings: true # use CLIP embeddings for matching
fallback_all_when_empty: true # fallback to all WM objects if no nearby ones
Object Promotion¶
Controls when proto-objects become confirmed:
object:
proto_ttl_s: 10.0 # seconds before unconfirmed proto expires
promote_hits: 2 # observations needed to promote
stability_promote: 0.55 # minimum stability score
require_view_bins: 1 # minimum viewpoint directions
stab_k: 0.45 # stability update factor
miss_decay: 0.5 # stability decay on missed frames
| Criterion | Config Key | Default |
|---|---|---|
| Observation count | promote_hits |
≥ 2 |
| Stability score | stability_promote |
≥ 0.55 |
| View diversity | require_view_bins |
≥ 1 viewpoint |
Sweep Cache & Spatial Grid¶
Controls the spatial indexing used for efficient neighbor lookups:
sweep_cache:
grid_size_m: 0.25 # cell size (meters)
per_cell_cap: 64 # max object IDs per cell
two_d: true # drop Z axis for indoor scenes
yaw_bins: 12 # 360° / 12 = 30° per bin
pitch_bins: 5
Vector Storage¶
Configure the embedding store for semantic search:
vectors:
enable: true
backend: faiss # faiss | milvus
dim: 512 # CLIP ViT-B/32 embedding size
faiss:
index_path: model_store/faiss/index.flatip
flush_period_s: 3.0
API Server¶
The REST API is available at http://localhost:8002. See REST API Reference.
MCP (Model Context Protocol)¶
Enable the embedded MCP server to expose spatial memory to AI agents:
When enabled, the MCP SSE endpoint is available at http://localhost:8002/mcp/sse. See REST API — MCP for available tools.
Visualization¶
visualization:
enable: true
host: 0.0.0.0
port: 8083 # WebSocket port for 3D frontend
depth:
min_m: 0.1
max_m: 3.5 # max depth for point cloud
tsdf:
enable: true
voxel_size: 0.01 # 1cm voxels
sdf_trunc: 0.04 # truncation distance
max_depth_m: 3.5
extract_every_n: 30 # extract mesh every N frames
objects:
push_interval_ms: 200 # push WM objects to clients
include_proto: true # include unconfirmed objects
analytics:
push_interval_ms: 1000 # analytics push cadence
full_sync_interval_s: 30 # full history re-send interval
The 3D visualization frontend connects via WebSocket at ws://localhost:8083/ws. See WebSocket API.
Analytics¶
analytics:
enable: true # enable runtime analytics buffers
retention_s: 3600 # Tier 2 history retention (1 hour)
buffer_frames: 300 # Tier 1 per-frame ring buffer size
When enabled, analytics are available via GET /stats/analytics and pushed to visualization clients.
SLAM (On-Device)¶
Settings for handling SLAM-corrected poses from Calabi Lens (on-device RTABMap):
slam:
log_pose_source: true
accept_pose_corrections: true # accept loop closure corrections
correct_working_memory: true # update WM positions on correction
Logging¶
Next Steps¶
- Architecture — Understand the system design
- REST API — API reference
- Benchmarks — Performance data per backend