Terrence C. Kim

Small Vision Language Models as Embedding Models

Small Vision Language Models as Document Embedding Models Recent advances in vision language models have opened new possibilities for document processing that dramatically outperform traditional OCR-based approaches. This report examines the feasibility, best practices, and current state-of-the-art for using small VLMs (under 3B parameters) as embedding models for semantic search and document understanding. The research reveals that small VLMs can achieve 10-20% better accuracy than traditional methods while processing documents directly as images, eliminating complex preprocessing pipelines. However, this comes with trade-offs in computational requirements and processing speed that organizations must carefully consider. Two revolutionary approaches dominate the landscape The field has converged on two primary architectures for document embeddings. VLM2Vec converts any state-of-the-art VLM into an embedding model through contrastive learning, creating single dense vectors o...

Terrence C. Kim

Search This Blog

Posts

Small Vision Language Models as Embedding Models

Small Vision Language Models as Embedding Models