fix(rag): use query_points instead of deprecated search method

qdrant-client 1.17.0 removed the search() method in favor of query_points(). Update the wrapper to use the new API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat(rag): use Ollama for embeddings instead of embedding-service
2026-02-27 07:51:12 +01:00 · 2026-02-27 07:46:57 +01:00 · 2026-02-26 23:29:23 +01:00 · 2026-02-26 23:24:47 +01:00
3 changed files with 71 additions and 27 deletions
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -385,8 +385,12 @@ services:
      MINIO_BUCKET: ${MINIO_BUCKET:-breakpilot-rag}
      MINIO_SECURE: "false"
      EMBEDDING_SERVICE_URL: http://embedding-service:8087
      OLLAMA_URL: ${OLLAMA_URL:-http://host.docker.internal:11434}
      OLLAMA_EMBED_MODEL: ${OLLAMA_EMBED_MODEL:-bge-m3}
      JWT_SECRET: ${JWT_SECRET:-your-super-secret-jwt-key-change-in-production}
      ENVIRONMENT: ${ENVIRONMENT:-development}
    extra_hosts:
      - "host.docker.internal:host-gateway"
    depends_on:
      qdrant:
        condition: service_healthy
@@ -414,7 +418,7 @@ services:
      - embedding_models:/root/.cache/huggingface
    environment:
      EMBEDDING_BACKEND: ${EMBEDDING_BACKEND:-local}
-      LOCAL_EMBEDDING_MODEL: ${LOCAL_EMBEDDING_MODEL:-sentence-transformers/all-MiniLM-L6-v2}
+      LOCAL_EMBEDDING_MODEL: ${LOCAL_EMBEDDING_MODEL:-BAAI/bge-m3}
      LOCAL_RERANKER_MODEL: ${LOCAL_RERANKER_MODEL:-cross-encoder/ms-marco-MiniLM-L-6-v2}
      PDF_EXTRACTION_BACKEND: ${PDF_EXTRACTION_BACKEND:-pymupdf}
      OPENAI_API_KEY: ${OPENAI_API_KEY:-}
@@ -423,7 +427,7 @@ services:
    deploy:
      resources:
        limits:
-          memory: 4G
+          memory: 8G
    healthcheck:
      test: ["CMD", "python", "-c", "import httpx; r=httpx.get('http://127.0.0.1:8087/health'); r.raise_for_status()"]
      interval: 30s
--- a/rag-service/embedding_client.py
+++ b/rag-service/embedding_client.py
@@ -1,4 +1,5 @@
 import logging
 import os
 from typing import Optional
 import httpx
@@ -8,44 +9,82 @@ from config import settings
 logger = logging.getLogger("rag-service.embedding")
 _TIMEOUT = httpx.Timeout(timeout=120.0, connect=10.0)
 _EMBED_TIMEOUT = httpx.Timeout(timeout=300.0, connect=10.0)
 # Ollama config for embeddings (bge-m3, 1024-dim)
 _OLLAMA_URL = os.getenv("OLLAMA_URL", "http://ollama:11434")
 _OLLAMA_EMBED_MODEL = os.getenv("OLLAMA_EMBED_MODEL", "bge-m3")
 # Batch size for Ollama embedding requests
 _EMBED_BATCH_SIZE = int(os.getenv("EMBED_BATCH_SIZE", "32"))
 class EmbeddingClient:
-    """HTTP client for the embedding-service (port 8087)."""
+    """
    Hybrid client:
    - Embeddings via Ollama (bge-m3, 1024-dim) for Qdrant compatibility
    - Chunking + PDF extraction via embedding-service (port 8087)
    """
    def __init__(self) -> None:
-        self._base_url: str = settings.EMBEDDING_SERVICE_URL.rstrip("/")
+        self._embed_svc_url: str = settings.EMBEDDING_SERVICE_URL.rstrip("/")
        self._ollama_url: str = _OLLAMA_URL.rstrip("/")
        self._embed_model: str = _OLLAMA_EMBED_MODEL
-    def _url(self, path: str) -> str:
+    def _svc_url(self, path: str) -> str:
-        return f"{self._base_url}{path}"
+        return f"{self._embed_svc_url}{path}"
    # ------------------------------------------------------------------
-    # Embeddings
+    # Embeddings (via Ollama)
    # ------------------------------------------------------------------
    async def generate_embeddings(self, texts: list[str]) -> list[list[float]]:
        """
-        Send a batch of texts to the embedding service and return a list of
+        Generate embeddings via Ollama's bge-m3 model.
-        embedding vectors.
+        Processes in batches to avoid timeout on large uploads.
        """
-        async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
+        all_embeddings: list[list[float]] = []
-            response = await client.post(
+
-                self._url("/api/v1/embeddings"),
+        for i in range(0, len(texts), _EMBED_BATCH_SIZE):
-                json={"texts": texts},
+            batch = texts[i : i + _EMBED_BATCH_SIZE]
-            )
+            batch_embeddings = []
-            response.raise_for_status()
+
-            data = response.json()
+            async with httpx.AsyncClient(timeout=_EMBED_TIMEOUT) as client:
-            return data.get("embeddings", [])
+                for text in batch:
                    response = await client.post(
                        f"{self._ollama_url}/api/embeddings",
                        json={
                            "model": self._embed_model,
                            "prompt": text,
                        },
                    )
                    response.raise_for_status()
                    data = response.json()
                    embedding = data.get("embedding", [])
                    if not embedding:
                        raise ValueError(
                            f"Ollama returned empty embedding for model {self._embed_model}"
                        )
                    batch_embeddings.append(embedding)
            all_embeddings.extend(batch_embeddings)
            if i + _EMBED_BATCH_SIZE < len(texts):
                logger.info(
                    "Embedding progress: %d/%d", len(all_embeddings), len(texts)
                )
        return all_embeddings
    async def generate_single_embedding(self, text: str) -> list[float]:
        """Convenience wrapper for a single text."""
        results = await self.generate_embeddings([text])
        if not results:
-            raise ValueError("Embedding service returned empty result")
+            raise ValueError("Ollama returned empty result")
        return results[0]
    # ------------------------------------------------------------------
-    # Reranking
+    # Reranking (via embedding-service)
    # ------------------------------------------------------------------
    async def rerank_documents(
@@ -60,7 +99,7 @@ class EmbeddingClient:
        """
        async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
            response = await client.post(
-                self._url("/api/v1/rerank"),
+                self._svc_url("/rerank"),
                json={
                    "query": query,
                    "documents": documents,
@@ -72,7 +111,7 @@ class EmbeddingClient:
            return data.get("results", [])
    # ------------------------------------------------------------------
-    # Chunking
+    # Chunking (via embedding-service)
    # ------------------------------------------------------------------
    async def chunk_text(
@@ -88,7 +127,7 @@ class EmbeddingClient:
        """
        async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
            response = await client.post(
-                self._url("/api/v1/chunk"),
+                self._svc_url("/chunk"),
                json={
                    "text": text,
                    "strategy": strategy,
@@ -101,7 +140,7 @@ class EmbeddingClient:
            return data.get("chunks", [])
    # ------------------------------------------------------------------
-    # PDF extraction
+    # PDF extraction (via embedding-service)
    # ------------------------------------------------------------------
    async def extract_pdf(self, pdf_bytes: bytes) -> str:
@@ -111,7 +150,7 @@ class EmbeddingClient:
        """
        async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
            response = await client.post(
-                self._url("/api/v1/extract-pdf"),
+                self._svc_url("/extract-pdf"),
                files={"file": ("document.pdf", pdf_bytes, "application/pdf")},
            )
            response.raise_for_status()
--- a/rag-service/qdrant_client_wrapper.py
+++ b/rag-service/qdrant_client_wrapper.py
@@ -167,12 +167,13 @@ class QdrantClientWrapper:
                    )
            qdrant_filter = qmodels.Filter(must=must_conditions)
-        results = self.client.search(
+        results = self.client.query_points(
            collection_name=collection,
-            query_vector=query_vector,
+            query=query_vector,
            limit=limit,
            query_filter=qdrant_filter,
            score_threshold=score_threshold,
            with_payload=True,
        )
        return [
@@ -181,7 +182,7 @@ class QdrantClientWrapper:
                "score": hit.score,
                "payload": hit.payload or {},
            }
-            for hit in results
+            for hit in results.points
        ]
    # ------------------------------------------------------------------
Author	SHA1	Message	Date
Benjamin Admin	5c8307f58a	fix(rag): use query_points instead of deprecated search method All checks were successful CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-consent (push) Successful in 38s Details CI / test-python-voice (push) Successful in 36s Details CI / test-bqas (push) Successful in 28s Details qdrant-client 1.17.0 removed the search() method in favor of query_points(). Update the wrapper to use the new API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 07:51:12 +01:00
Benjamin Admin	92ca5b7ba5	feat(rag): use Ollama for embeddings instead of embedding-service Switch to Ollama's bge-m3 model (1024-dim) for generating embeddings, solving the dimension mismatch with Qdrant collections. Embedding-service still used for chunking, reranking, and PDF extraction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 07:46:57 +01:00
Benjamin Admin	d7cc6bfbc7	Switch embedding model to bge-m3 (1024-dim) The Qdrant collections use 1024-dim vectors (bge-m3) but the embedding-service was configured with all-MiniLM-L6-v2 (384-dim). Also increase memory limit to 8G for the larger model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 23:29:23 +01:00
Benjamin Admin	13ba1457b0	Fix embedding client endpoint paths The embedding-service exposes endpoints at root level (/chunk, /embed, /extract-pdf, /rerank) not under /api/v1/. Fix the RAG service's embedding client to use the correct paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 23:24:47 +01:00