feat(pipeline): MC Quality Overhaul — 74.5% → 92.8% accuracy, 5.3K → 13.6K MCs
Phase 0: Quality Audit script (Claude Sonnet, 1750 samples) Phase 1: Object ontology expanded 31 → 74 tokens with descriptions + boundaries Phase 2: 174K controls re-classified via Haiku (10 batches, $50) - Generic tokens removed (documentation, procedure, process) - L2 sub-topics added (108K + 64K controls) - Bad subtopics fixed (stakeholder_*, escalation fragments) Phase 3: Re-clustering K=18704 (37K objects → 16.7K groups) Phase 4: Direct MC generation from canonical tokens (gpre2_direct_mc.py) Phase 5: Regulation-source split (gpre3, dry-run tested) New features: - Tenant-isolated document upload API (rag-service) - BAuA crawler (Playwright, 131 PDFs downloaded) - OSHA Technical Manual crawler (23 chapters) - CE obligation extractor (6141 obligations from Qdrant) RAG ingestion: - 126 BAuA PDFs (TRBS/TRGS/ASR): 27,664 chunks - OSHA Technical Manual: 7,241 chunks - OSHA 1910 Subpart O (full): 745 chunks - EuGH C-588/21 P: 216 chunks - EU 2018/1725: 842 chunks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -122,6 +122,16 @@ class MinioClientWrapper:
|
||||
logger.error("Failed to delete '%s': %s", object_name, exc)
|
||||
raise
|
||||
|
||||
async def delete_by_prefix(self, prefix: str) -> int:
|
||||
"""Remove all objects under a prefix."""
|
||||
objects = self.client.list_objects(settings.MINIO_BUCKET, prefix=prefix, recursive=True)
|
||||
count = 0
|
||||
for obj in objects:
|
||||
self.client.remove_object(settings.MINIO_BUCKET, obj.object_name)
|
||||
count += 1
|
||||
logger.info("Deleted %d objects with prefix '%s'", count, prefix)
|
||||
return count
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Presigned URL
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
Reference in New Issue
Block a user