feat(consent-tester): Phase E — self-improving CMP library

cmp_discovery_log.py:
- sqlite log at /data/cmp_discoveries.db: every LLM-discovered CMP
  pattern recorded with domain, strategy, value, sample text
- Auto-promote (user-chosen 'voll automatisch' mode): when LLM returns
  strategy=url AND extracted text >= 800 words, write a new module
  /data/auto_cmp/auto_<slug>.py with derived regex matcher + reconstruct
- record_discovery() called from dsi_discovery._try_llm_cascade on success

cmp_library/_registry.py:
- Loads both hand-written modules from services/cmp_library/ AND
  auto-promoted modules from /data/auto_cmp/ (CMP_AUTO_DIR env)
- Auto modules use importlib.util.spec_from_file_location, no package
  install needed; restart consent-tester to pick up new ones

dsi_discovery.py:
- _try_llm_cascade now calls record_discovery() on every successful
  LLM analysis (cached AND fresh)

main.py:
- GET /cmp-discoveries — admin endpoint listing all logged discoveries
- DELETE /cmp-discoveries/{id} — rollback (unlinks auto_*.py)

This closes the self-improving loop: first encounter with a new CMP fires
the LLM (cost) → discovery is auto-promoted → all future runs against the
same vendor pattern hit Phase B (Named CMP) at <50ms with no LLM call.
This commit is contained in:
Benjamin Admin
2026-05-16 23:09:23 +02:00
parent 2400aa6a9e
commit 5f2da1de88
4 changed files with 298 additions and 12 deletions
+12
View File
@@ -836,6 +836,18 @@ async def _try_llm_cascade(
if wc >= 300:
await cache_set(netloc, hint)
logger.info("LLM cached for %s (%s): %d words", netloc, hint.get("_tier"), wc)
# Phase E: log discovery + (if eligible) auto-promote to named CMP
try:
from services.cmp_discovery_log import record_discovery
record_discovery(
domain=netloc,
llm_used=hint.get("_tier", "unknown"),
strategy=hint.get("strategy", ""),
value=hint.get("value", ""),
extracted_text=text,
)
except Exception as e:
logger.debug("CMP discovery log failed: %s", e)
return text, wc