Open-World Text-Based 3D Object Search

Open-vocabulary 3D object search using frozen point cloud encoders (Concerto / Utonia) and a CLIP-aligned MLP translation head. Validated on S3DIS Area 5 and in-the-wild Polycam scans.

Overview

This project builds and evaluates a pipeline for open-world text-based object search in 3D point cloud scenes. The core idea:

Frozen 3D encoder — We use Concerto Small (39M params, pretrained on large-scale 3D data) to extract per-point features from indoor scans. The backbone is never finetuned.
CLIP-aligned MLP translation head — A lightweight 2–3 layer MLP maps Concerto's 3D feature space into CLIP's text embedding space. This is trained with supervision on S3DIS label↔CLIP-text-embedding pairs.
Open-vocabulary querying — At inference, a user provides a free-text query (e.g., "red chair", "lamp", "whiteboard"). The query is embedded by CLIP's text encoder and matched against the translated per-point features via cosine similarity, producing a heatmap over the point cloud.
In-the-wild generalization — We test the pipeline on at least one scene captured with a mobile LiDAR/photogrammetry app (Polycam), exported as .ply.

Optional extension: If Concerto's cross-domain generalization proves insufficient on out-of-domain scans, we compare with Utonia — a newer encoder (March 2026) with cross-domain robustness — contingent on public weight availability.

Architecture

┌──────────────┐     ┌─────────────────┐     ┌──────────────────┐
│  Point Cloud  │────▶│  Concerto Small  │────▶│  Per-point 3D    │
│  (XYZ + RGB)  │     │  (frozen, 39M)   │     │  features (D=256)│
└──────────────┘     └─────────────────┘     └───────┬──────────┘
                                                      │
                                                      ▼
                                              ┌───────────────┐
                                              │ MLP Translation│
                                              │ Head (trainable│
                                              │ 2–3 layers)    │
                                              └───────┬───────┘
                                                      │
                                                      ▼
┌──────────────┐     ┌─────────────────┐     ┌──────────────────┐
│  Text query   │────▶│  CLIP Text       │────▶│  Cosine sim /    │
│  "red chair"  │     │  Encoder (frozen)│     │  heatmap on PC   │
└──────────────┘     └─────────────────┘     └──────────────────┘

Repository Structure

Deep_learning_project/
├── README.md                      # This file
├── LICENSE
├── .gitignore
├── pyproject.toml                 # Project dependencies (uv)
│
├── docs/                          # Documentation & papers
│   ├── Concerto.pdf
│   ├── Utonia.pdf
│   ├── repository_guide.md        # Repo conventions, branching, version control
│   ├── collaboration_plan.md      # GitHub + Drive workflow
│   ├── feasibility_evaluation.md  # Risk assessment
│   ├── work_plan_leonardo.md      # Per-person task sheets
│   ├── work_plan_ricardo.md
│   ├── work_plan_adrian.md
│   └── work_plan_matteo.md
│
├── configs/                       # Training & eval config files (YAML)
│   ├── train_mlp_s3dis.yaml
│   └── eval_s3dis.yaml
│
├── src/                           # Core Python source code
│   ├── __init__.py
│   ├── encoder.py                 # Concerto feature extraction wrapper
│   ├── translation_head.py        # MLP definition & forward pass
│   ├── clip_utils.py              # CLIP text embedding helpers
│   ├── dataset.py                 # S3DIS + Polycam dataset loaders
│   ├── train.py                   # Training loop for the MLP head
│   ├── evaluate.py                # Quantitative evaluation (mIoU, top-k)
│   └── visualize.py               # 3D heatmap visualization utilities
│
├── notebooks/                     # Colab notebooks (one per workflow)
│   ├── 01_setup_and_data.ipynb
│   ├── 02_feature_extraction.ipynb
│   ├── 03_train_mlp.ipynb
│   ├── 04_evaluate.ipynb
│   └── 05_demo.ipynb
│
├── scripts/                       # CLI utility scripts
│   ├── extract_features.py        # Batch feature extraction
│   ├── prepare_s3dis.py           # S3DIS preprocessing
│   └── export_polycam.py          # Polycam .ply → pipeline format
│
├── tests/                         # Unit & smoke tests
│   └── test_translation_head.py
│
└── presentation/                  # Final slides & demo materials
    └── .gitkeep

Dependencies

Package	Version	Purpose
`torch`	≥ 2.1	Core framework
`pointcept`	latest	Concerto encoder & data utilities
`open_clip_torch`	≥ 2.24	CLIP text encoder
`spconv-cu118` / `spconv-cu12x`	≥ 2.3	Sparse convolution backend; exact wheel depends on the active Colab CUDA runtime
`open3d`	≥ 0.18	Point cloud I/O & visualization
`numpy`, `scipy`	latest	Numerical utilities
`pyyaml`	latest	Config parsing
`wandb` (optional)	latest	Experiment tracking
`plotly` (optional)	latest	Interactive 3D visualization

A pyproject.toml is provided. On Colab, install with:
!pip install uv
!uv pip install --system -e .
notebooks/03_train_mlp.ipynb now skips Concerto/spconv entirely when pre-extracted features/s3dis_area5/*.npz already exist, and otherwise auto-tests a few spconv wheels before failing.

Setup & Installation (Colab)

# 1. Clone the repo
!git clone https://github.com/Gandata/Deep_learning_project.git
%cd Deep_learning_project

# 2. Install dependencies
!pip install uv
!uv pip install --system -e .

# 3. Mount Google Drive (for data & checkpoints)
from google.colab import drive
drive.mount('/content/drive')

# 4. Symlink data
!ln -s /content/drive/MyDrive/DL_Project/data ./data
!ln -s /content/drive/MyDrive/DL_Project/checkpoints ./checkpoints

Data Preparation

S3DIS Area 5

Download S3DIS from the Stanford website (requires form).
Place the raw data in Drive > DL_Project > data > s3dis_raw/.

Run preprocessing:

python scripts/prepare_s3dis.py --input data/s3dis_raw --output data/s3dis_processed

Polycam In-the-Wild Scan

Capture a scene using Polycam on iOS/Android.
Export as .ply (point cloud mode, with RGB).
Place in Drive > DL_Project > data > polycam/.

Training the Translation Head

python src/train.py --config configs/train_mlp_s3dis.yaml

Key hyperparameters (see config):

MLP layers: 3 (256 → 512 → 512 → 512, with ReLU + dropout)
Loss: MSE or cosine embedding loss between predicted embeddings and CLIP text embeddings of ground-truth labels
Optimizer: AdamW, lr=1e-3, weight decay=1e-4
Epochs: 50–100 (early stopping on val loss)
Batch size: Adjusted to fit T4 VRAM (~15GB)

Evaluation

python src/evaluate.py --config configs/eval_s3dis.yaml --split area5

Metrics:

mIoU (semantic segmentation via nearest-label matching)
Top-k retrieval accuracy (given a text query, what % of top-k points belong to the correct class)
Qualitative heatmaps (per-query 3D visualizations)

In-the-Wild Demo

The notebooks/05_demo.ipynb notebook provides an interactive demo:

Load a Polycam .ply scan
Extract Concerto features (frozen)
Apply the trained MLP translation head
Enter a free-text query → visualize the heatmap on the 3D scene

Team & Acknowledgements

Member	Role
Ricardo	Lead engineer — feature extraction, evaluation, Utonia comparison
Leonardo	Encoder integration, MLP architecture, training pipeline
Adrian	Data preparation, Polycam pipeline, demo notebook
Matteo	Evaluation scripts, visualization, presentation & slides

Course: Deep Learning — Master's program
Compute: Google Colab free tier (NVIDIA T4 GPU)

References

Concerto: Concerto: Cooperative Contrastive Pretraining for 3D Point Cloud Understanding — GitHub | HuggingFace
Utonia: Utonia: Universal 3D Tokenization via Neural Codec — GitHub | HuggingFace
CLIP: Radford et al., Learning Transferable Visual Models From Natural Language Supervision, 2021
S3DIS: Armeni et al., 3D Semantic Parsing of Large-Scale Indoor Spaces, CVPR 2016
Pointcept: github.com/Pointcept/Pointcept

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-World Text-Based 3D Object Search

Table of Contents

Overview

Architecture

Repository Structure

Dependencies

Setup & Installation (Colab)

Data Preparation

S3DIS Area 5

Polycam In-the-Wild Scan

Training the Translation Head

Evaluation

In-the-Wild Demo

Team & Acknowledgements

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
configs		configs
docs		docs
eval-demo/notebooks		eval-demo/notebooks
notebooks		notebooks
presentation		presentation
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
readme_dataset.txt		readme_dataset.txt
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Open-World Text-Based 3D Object Search

Table of Contents

Overview

Architecture

Repository Structure

Dependencies

Setup & Installation (Colab)

Data Preparation

S3DIS Area 5

Polycam In-the-Wild Scan

Training the Translation Head

Evaluation

In-the-Wild Demo

Team & Acknowledgements

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages