LERF enhances a detailed, multi-layered 3D language field through volume rendering of CLIP embeddings along training rays. This process is guided by multi-scale CLIP features derived from various training images. Once optimized, LERF is capable of generating 3D relevance maps for language queries in an interactive, real-time manner.
It facilitates pixel-precise language queries within the distilled 3D CLIP embeddings, independent of region proposals, masks, or additional fine-tuning. This approach supports extensive, open-vocabulary queries at multiple hierarchical levels throughout the volume.
Why 3D CLIP embeddings?
When supervised from multiple viewpoints, 3D CLIP embeddings demonstrate greater resilience to occlusions and changes in viewpoint compared to their 2D counterparts. Additionally, they align more accurately with the structure of the 3D scene, resulting in a sharper and clearer appearance.
For the supervision of language embeddings, an image pyramid composed of CLIP features is pre-calculated for each training view. During the optimization process, each ray that is sampled is guided by interpolating the CLIP embedding within this pyramid structure.
Natural language interaction has the potential to enable large language models (LLMs) to engage with and understand the 3D world. An example of this is demonstrated where ChatGPT 5 may be used to issue language queries about identifying objects necessary for cleaning up a coffee spill. This showcases how LLMs can be applied to practical, real-world scenarios by interacting through natural language.
Read other related articles: