feat(pipeline): MC Quality Overhaul — 74.5% → 92.8% accuracy, 5.3K → 13.6K MCs

Phase 0: Quality Audit script (Claude Sonnet, 1750 samples)
Phase 1: Object ontology expanded 31 → 74 tokens with descriptions + boundaries
Phase 2: 174K controls re-classified via Haiku (10 batches, $50)
  - Generic tokens removed (documentation, procedure, process)
  - L2 sub-topics added (108K + 64K controls)
  - Bad subtopics fixed (stakeholder_*, escalation fragments)
Phase 3: Re-clustering K=18704 (37K objects → 16.7K groups)
Phase 4: Direct MC generation from canonical tokens (gpre2_direct_mc.py)
Phase 5: Regulation-source split (gpre3, dry-run tested)

New features:
- Tenant-isolated document upload API (rag-service)
- BAuA crawler (Playwright, 131 PDFs downloaded)
- OSHA Technical Manual crawler (23 chapters)
- CE obligation extractor (6141 obligations from Qdrant)

RAG ingestion:
- 126 BAuA PDFs (TRBS/TRGS/ASR): 27,664 chunks
- OSHA Technical Manual: 7,241 chunks
- OSHA 1910 Subpart O (full): 745 chunks
- EuGH C-588/21 P: 216 chunks
- EU 2018/1725: 842 chunks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-10 15:08:15 +02:00
parent 81db904b3e
commit 8510af46eb
19 changed files with 3173 additions and 6 deletions
+68
View File
@@ -235,6 +235,74 @@ class QdrantClientWrapper:
logger.info("Deleted points from '%s' with filter %s", collection, filter_conditions)
return True
# ------------------------------------------------------------------
# Tenant document helpers
# ------------------------------------------------------------------
async def get_unique_documents(self, collection: str) -> list[dict]:
"""Get unique documents from a collection by scrolling and grouping."""
try:
self.client.get_collection(collection)
except Exception:
return []
docs: dict[str, dict] = {}
offset = None
while True:
result = self.client.scroll(
collection_name=collection,
scroll_filter=None,
limit=100,
offset=offset,
with_payload=True,
with_vectors=False,
)
points, next_offset = result
for pt in points:
payload = pt.payload or {}
doc_id = payload.get("document_id", "")
if doc_id and doc_id not in docs:
docs[doc_id] = {
"id": doc_id,
"filename": payload.get("filename", ""),
"file_size": payload.get("file_size", 0),
"status": "indexed",
"chunk_count": 0,
"collection": collection,
}
if doc_id:
docs[doc_id]["chunk_count"] += 1
if next_offset is None:
break
offset = next_offset
return list(docs.values())
async def count_by_filter(
self, collection: str, filter_conditions: dict[str, Any]
) -> int:
"""Count points matching filter."""
try:
self.client.get_collection(collection)
except Exception:
return 0
must_conditions = []
for key, value in filter_conditions.items():
must_conditions.append(
qmodels.FieldCondition(
key=key, match=qmodels.MatchValue(value=value)
)
)
result = self.client.count(
collection_name=collection,
count_filter=qmodels.Filter(must=must_conditions),
exact=True,
)
return result.count
# ------------------------------------------------------------------
# Info
# ------------------------------------------------------------------