Skip to content

brain-squishers/mind-two

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Mind Two

Mind Two is a real-time assistive spatial perception system for people with visual impairments that uses LLM-guided queries, Grounding DINO, SAM 2, and DepthAnything to localize and track objects in 3D while preserving egocentric spatial reasoning even after they leave the camera frame.

Watch the demo

Demo video: https://www.youtube.com/watch?v=kxXIVC9BiPU

Original repo: https://github.com/patrick-tssn/Streaming-Grounded-SAM-2

The current live pipeline combines:

  • OpenAI GPT-based query extraction
  • Grounding DINO for text-conditioned detection
  • SAM 2 for mask initialization and tracking
  • Depth Anything V2 Metric for distance estimation
  • Optional audio query input with OpenAI speech-to-text

The main entrypoint is run_live.py.

Toolchain Setup

Architecture overview High-level architecture for the current live perception pipeline.

The runtime is split into a few concrete layers:

Model Pipeline

The live path in run_live.py works like this:

  1. Input enters as either:

    • a text query from --query, or
    • an audio command captured after the wake phrase hello
  2. The query is queued and sent to the LLM extraction step.

    • The extraction produces:
      • targets
      • anchors
      • support_surfaces
  3. Anchor selection is resolved.

    • Default: fixed anchors
    • Optional: LLM-derived anchors
  4. Grounding DINO runs on the current frame for the active target phrase.

    • This produces candidate boxes for target initialization or re-detection.
  5. SAM 2 loads the frame and initializes tracked objects from those boxes.

    • After initialization, SAM 2 handles intermediate tracking updates between re-detections.
  6. Depth Anything V2 Metric runs on the full frame in a background worker.

    • The runtime samples depth values only inside the tracked SAM masks.
    • Median object depth is used for the distance overlay and scene reasoning.
  7. Context detections can run in parallel.

    • Anchor detections
    • Support-surface detections
    • Optional hand detections
  8. Scene reasoning and memory update from the tracked target state.

    • Spatial relations are computed from target, anchors, supports, and hand context.
    • Stable target observations can be written into scene memory.
  9. The UI overlay renders the current state.

    • query summary
    • tracking labels
    • depth estimates
    • spatial relations
    • memory fallback text when tracking is lost

In short:

query/audio -> LLM extraction -> Grounding DINO boxes -> SAM 2 tracking -> Depth Anything masked distance -> scene reasoning + memory -> overlay

Setup

Create an environment and install:

conda create -n sam2 python=3.10 -y
conda activate sam2
pip install -e .

If you use GPT models, set your API key in llm/.env:

API_KEY="..."
API_BASE=""

API_BASE is optional and only needed for Azure-style routing in the existing wrapper.

Download the required checkpoints:

cd checkpoints
./download_ckpts.sh
cd gdino_checkpoints
hf download IDEA-Research/grounding-dino-tiny --local-dir grounding-dino-tiny
cd depth_anything_checkpoints
./download_metric_indoor_ckpts.sh

Main Run Commands

Webcam + text query

This is the simplest local run:

python run_live.py --model gpt-4o-2024-05-13

Pass an explicit text query:

python run_live.py --model gpt-4o-2024-05-13 --query "I am trying to find my phone"

Use a different camera:

python run_live.py --model gpt-4o-2024-05-13 --camera-index 1

Webcam + audio query input

Audio query mode listens for the wake phrase hello, then records the spoken command and transcribes it with gpt-4o-transcribe.

python run_live.py --model gpt-4o-2024-05-13 --query-input audio

If your microphone is not the default input device:

python run_live.py --model gpt-4o-2024-05-13 --query-input audio --audio-input-device-index 1

If you want to override the wake phrase:

python run_live.py --model gpt-4o-2024-05-13 --query-input audio --wake-phrase "hello"

Server frame source

The live runner can also read frames from the local FastAPI server endpoint instead of a directly attached webcam.

Start the server:

python rtc_client_server/server.py

Then run the live pipeline against the server stream:

python run_live.py --model gpt-4o-2024-05-13 --frame-source server --stream-url http://127.0.0.1:5000/stream/latest-frame

If you also want audio query input while using server frames:

python run_live.py --model gpt-4o-2024-05-13 --frame-source server --stream-url http://127.0.0.1:5000/stream/latest-frame --query-input audio

Raspberry Pi client for the server

If you are using the included WebRTC client stream path, the client entrypoint is:

python rtc_client_server/client.py

That client currently has its server_url set directly inside rtc_client_server/client.py, so update it there if needed.

Query Input Modes

Text input

Default mode:

python run_live.py --model gpt-4o-2024-05-13

Audio input

Enabled with:

python run_live.py --model gpt-4o-2024-05-13 --query-input audio

Useful audio flags:

--wake-phrase hello
--transcription-model gpt-4o-transcribe
--audio-input-device-index 1
--audio-silence-threshold 550
--min-silence-duration-s 1.0

Anchor Modes

Anchors can come from either:

  • a fixed configured list
  • the LLM extraction output

Fixed anchors

This is the default behavior.

Current fixed anchor list:

  • water bottle
  • rubber duck
  • marker
  • usb
  • towel
  • snack

Run with default fixed anchors:

python run_live.py --model gpt-4o-2024-05-13

Run with an explicit custom fixed anchor list:

python run_live.py --model gpt-4o-2024-05-13 --anchor-source fixed --fixed-anchors "water bottle,rubber duck,marker,usb,towel,snack"

LLM-derived anchors

Switch to anchors from the LLM extraction:

python run_live.py --model gpt-4o-2024-05-13 --anchor-source llm

Useful Runtime Flags

Disable depth:

python run_live.py --model gpt-4o-2024-05-13 --disable-depth

Use the server frame source:

python run_live.py --model gpt-4o-2024-05-13 --frame-source server

Lower target detection thresholds are now the defaults:

  • --target-box-threshold 0.30
  • --target-text-threshold 0.25

Notes

  • The audio wake detection in this version is transcription-based, not a local keyword spotter.
  • In audio mode, say hello, then briefly pause, then say the command.
  • The live pipeline entrypoint is run_live.py.
  • The stream server entrypoint is rtc_client_server/server.py.

Citations

If you use this repository, cite the upstream projects and papers it builds on.

Original Streaming Grounded SAM 2 repo

SAM 2

Grounding DINO

Depth Anything V2

Segment Anything 2 codebase

Grounded-SAM-2

About

A real-time assistive spatial perception system for locating and tracking objects in 3D while preserving egocentric spatial reasoning beyond the current frame.

Resources

License

Apache-2.0, BSD-3-Clause licenses found

Licenses found

Apache-2.0
LICENSE
BSD-3-Clause
LICENSE_cctorch

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors