Perception Pipeline¶

The perception pipeline extracts object instances from RGB-D frames and encodes them for matching and search. The segmentation stage is swappable — the rest of the pipeline is shared across all backends.

Pipeline Stages¶

RGB Frame → Segmentation → Heuristics → Top-K Scoring → CLIP Encode → Vocab Classify
              (swappable)   (filtering)   (priority)     (embedding)    (labeling)

1. Segmentation¶

The first stage produces instance masks from the RGB image. RTSM supports multiple backends:

grounded_sam2 (default, Apache-2.0)¶

Grounding DINO detects objects via text prompt, then SAM2 generates high-quality masks from the detected bounding boxes.

Architecture: Transformer (Swin backbone + Hiera ViT)
Input: 640x480 RGB + text prompt (30-class indoor vocabulary)
Output: ~13 masks/frame (text-prompted detection)
Seg time: 222 ms mean

dual (FastSAM + YOLOE, AGPL-3.0)¶

Two CNN models run independently, then masks are cross-validated via IoU:

FastSAM: Class-agnostic segmentation (~24 masks/frame)
YOLOE: Open-vocabulary detection + segmentation (~11 masks/frame, 1200+ LVIS categories)
Merge: IoU > 0.40 = dual-confirmed, remainder kept as single-source
Output: ~29 masks/frame (merged)
Seg time: 116 ms mean

Dual-confirmed masks receive a priority boost and are 1.5x more likely to be selected into top-K.

Other backends¶

Backend	Description	Seg time
`sam2`	SAM2 auto-mask (segment everything, no labels)	~860 ms
`fastsam`	FastSAM only (class-agnostic, fast)	~50 ms
`yoloe`	YOLOE only (open-vocab, prompt-free)	~60 ms

2. Mask Heuristics¶

Heuristic filters remove unsuitable masks using depth and geometric information:

Filter	Purpose	Config key
Min area	Remove noise/tiny fragments	`filters.min_area_px` (500)
Max coverage	Remove walls/floors/background	`masks.max_coverage` (0.8)
Aspect ratio	Remove extreme shapes	`filters.aspect_ratio` ([0.2, 5.0])
Border contact	Reject masks touching frame edges	`filters.border_touch_max_pct` (0.15)
Depth validity	Require minimum valid depth pixels	`filters.depth.valid_min_pct` (0.10)
Depth range	Reject too close/far objects	`filters.depth.z_min_m` / `z_max_m`
Depth spread	Reject noisy depth regions	`filters.depth.sigma_max_m` (0.50)
Planarity	Detect and score planar surfaces	`planarity.*`

Heuristics cost varies by backend

SAM2-based masks take 4x longer to filter than FastSAM masks (239 ms vs 60 ms), likely due to higher-fidelity mask boundaries requiring more compute in depth validation and planarity checks.

3. Top-K Selection¶

After filtering, masks are scored and the top K (default: 15) are kept to bound CLIP compute:

Priority scoring considers:

Coverage (mask area relative to frame)
Depth validity (% of mask with valid depth)
Border fraction (penalty for edge-touching)
Depth spread (penalty for noisy depth)
Structure score (planarity + geometry)
Dual-confirmation boost (if applicable)

Same-frame deduplication removes overlapping candidates before CLIP.

4. CLIP Encoding¶

Each selected mask is:

Cropped from the RGB image (with 6px padding)
Resized to 224x224
Encoded via CLIP ViT-B/32 (OpenAI)

Output: 512-dimensional L2-normalized embedding vector

These embeddings enable:

Matching observations across frames (cosine similarity)
Semantic search via text queries
Object identity tracking over time

Speed: ~23 ms for 15 candidates (batch encode on GPU).

5. Vocabulary Classification¶

Object labels are assigned by comparing the CLIP embedding to pre-computed text embeddings from a configurable vocabulary (config/clip/vocab.yaml):

similarities = cosine_similarity(image_embedding, text_embeddings)
label = vocab[argmax(similarities)]
confidence = max(similarities)

Labels are tracked as EWMA (exponentially weighted moving average) scores across observations, so transient misclassifications don't persist.

Performance Summary¶

Measured on RTX 5090, 640x480 input. See Benchmarks for full data.

Stage	dual (ms)	grounded_sam2 (ms)
Segmentation	116	222
Heuristics	60	239
Scoring	0.2	0.2
CLIP encode	23	24
Association	6	4
Total	210	510

Next Steps¶

Memory Model — How observations become persistent objects
Architecture — Full system overview
Benchmarks — Detailed performance data