M3

3D-Spatial MultiModel Memory

Paper

arXiv

Video Code

ICLR 2025

Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng,
Jianglong Ye, Sifei Liu, Xiaolong Wang

UC San Diego, NVIDIA

M3 is a framework for rendering static 3D scenes with RGB and foundation model embeddings, enabling rich spatial and semantic understanding.

Precise Feature Reconstruction: M3 uses Gaussian Memory Attention to reconstruct spatial memory directly in the foundation model's embedding space, avoiding distillation and preserving the source model's embedding space.

Efficient Feature Representation: M3 reduces embedding dimensions from 64 to 16–32 per Gaussian primitive, achieving equal or better performance with 50% fewer dimensions for improved efficiency.

The best part? Fewer parameters, original embedding space!

Technical Summary Video

Interactive Demo

We propose a new visualization tool that support streaming of 3D scene reconstruction for RGB, and Foundation Model embeddings with GPU as backend.

Real Robot Deployment

We deploy M3 on tabletop manipulation tasks and show demo videos, noted that M3 is currently only used for localization and mapping.

Memory to Rendering

In the visualization below, we have shown the raw feature manifold (blue points) and the memory extracted by M3 (red points), with the proposed M3 method, we apply gaussian memory attention from principle scene component to the rendered high resolution feature map (third row).

M3