NewMCP ServerView docs
Core Concepts

Dense Vectors

Learn how dense vector embeddings enable semantic search.

6 min readUpdated 2026-01-16

Dense Vectors

Dense vectors (embeddings) are numerical representations of text that capture semantic meaning.

How It Works

Text is transformed into a high-dimensional vector:

"The quick brown fox" → [0.12, -0.45, 0.78, ..., 0.23]  // 1024 dimensions

Similar texts produce similar vectors, enabling semantic search.

BGE-M3: Our Default Model

LH42 uses BGE-M3, which stands for:

  • BAAI General Embedding
  • Multi-lingual (100+ languages)
  • Multi-functional (retrieval, classification, clustering)
  • Multi-granularity (sentence to document)

Vector Dimensions

python
# BGE-M3 produces 1024-dimensional vectors
embedding = client.embed("Hello world")
print(len(embedding))  # 1024

Similarity Metrics

We use cosine similarity to compare vectors:

similarity = (A · B) / (||A|| × ||B||)

Range: -1 (opposite) to 1 (identical)

Batch Embedding

For efficiency, embed multiple texts at once:

python
texts = ["Document 1", "Document 2", "Document 3"]
embeddings = client.embed_batch(texts)

Custom Models

Bring your own embedding model:

python
client = LakehouseClient(
    api_key="...",
    embedding_model="your-custom-model"
)

Best Practices

  1. Chunk appropriately - 256-512 tokens per chunk
  2. Include context - Add titles and metadata to chunks
  3. Normalize vectors - Ensures consistent similarity scores
  4. Cache embeddings - Avoid re-computing for the same text