Perception Pipeline¶

The perception pipeline extracts object instances from RGB-D frames and encodes them for matching and search.

Pipeline Stages¶

RGB Frame → FastSAM → Mask Filter → CLIP Encode → Vocab Classify

1. Segmentation (FastSAM)¶

FastSAM generates instance masks from the RGB image.

Input: 640×480 RGB image
Output: Variable number of binary masks
Speed: ~10-15ms per frame

FastSAM is a CNN-based approximation of SAM (Segment Anything), trading some accuracy for 50× faster inference.

2. Mask Filtering¶

Heuristic filters remove unsuitable masks:

Filter	Purpose
Min area (0.1%)	Remove noise/tiny fragments
Max area (50%)	Remove walls/floors/background
Aspect ratio	Remove extreme shapes
Edge touching	Optionally filter partial objects

This typically rejects 10-15% of masks as insignificant.

3. Top-K Selection¶

After filtering, we keep only the top K masks (default: 20) per frame to bound compute cost. Selection prioritizes:

Mask confidence score
Area (medium-sized preferred)
Distance from frame center

4. CLIP Encoding¶

Each mask is:

Cropped from the RGB image (with padding)
Resized to 224×224
Encoded via CLIP ViT-B/32

Output: 512-dimensional embedding vector

These embeddings enable:

Matching observations across frames
Semantic search via text queries

5. Vocabulary Classification¶

Object labels are assigned by comparing the CLIP embedding to pre-computed text embeddings:

text_embeddings = clip.encode_text([
    "a photo of a mug",
    "a photo of a backpack",
    "a photo of a chair",
    ...
])

similarities = cosine_similarity(image_embedding, text_embeddings)
label = vocab[argmax(similarities)]
confidence = max(similarities)

The vocabulary is configurable — add domain-specific objects for your use case.

Performance¶

Stage	Time (RTX 5090)
FastSAM	~12ms
Mask filtering	<1ms
CLIP encode (20 masks)	~15ms
Vocab classify	<1ms
Total	<30ms

Next Steps¶

Memory Model — How observations become persistent objects
Architecture — Full system overview