6 Commits

Author SHA1 Message Date
Benjamin Admin
148c7ba3af feat(qa): recital detection, review split, duplicate comparison
Some checks failed
CI/CD / go-lint (push) Has been skipped
CI/CD / python-lint (push) Has been skipped
CI/CD / nodejs-lint (push) Has been skipped
CI/CD / test-go-ai-compliance (push) Failing after 42s
CI/CD / test-python-backend-compliance (push) Successful in 34s
CI/CD / test-python-document-crawler (push) Successful in 21s
CI/CD / test-python-dsms-gateway (push) Successful in 20s
CI/CD / validate-canonical-controls (push) Successful in 12s
CI/CD / Deploy (push) Has been skipped
Add _detect_recital() to QA pipeline — flags controls where
source_original_text contains Erwägungsgrund markers instead of
article text (28% of controls with source text affected).

- Recital detection via regex + phrase matching in QA validation
- 10 new tests (TestRecitalDetection), 81 total
- ReviewCompare component for side-by-side duplicate comparison
- Review mode split: Duplikat-Verdacht vs Rule-3-ohne-Anchor tabs
- MkDocs: recital detection documentation
- Detection script for bulk analysis (scripts/find_recital_controls.py)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 08:20:02 +01:00
Benjamin Admin
a9e0869205 feat(pipeline): pipeline_version v2, migration 062, docs + 71 tests
- Add PIPELINE_VERSION=2 constant and pipeline_version column to
  canonical_controls and canonical_processed_chunks (migration 062)
- Anthropic API decides chunk relevance via null-returns (skip_prefilter)
- Annex/appendix chunks explicitly protected in prompts
- Fix 6 failing tests (CRYP domain, _process_batch tuple return)
- Add TestPipelineVersion + TestRegulationFilter test classes (10 new tests)
- Add MkDocs page: control-generator-pipeline.md (541 lines)
- Update canonical-control-library.md with v2 pipeline diagram
- Update testing.md with 71-test breakdown table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 17:31:11 +01:00
Benjamin Admin
653aad57e3 Let Anthropic API decide chunk relevance instead of local prefilter
Updated both structure_batch and reformulate_batch prompts to return null
for chunks without actionable requirements (definitions, TOCs, scope-only).
Explicit instruction to always process annexes/appendices as they often
contain concrete technical requirements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 16:44:01 +01:00
Benjamin Admin
a7f7e57dd7 Add skip_prefilter option to control generator
Local LLM prefilter (llama3.2 3B) was incorrectly skipping annex chunks
that contain concrete requirements. Added skip_prefilter flag to bypass
the local pre-filter and send all chunks directly to Anthropic API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 16:30:57 +01:00
Benjamin Admin
567e82ddf5 Fix stale DB session after long embedding pre-load
The embedding pre-load for 4998 existing controls takes ~16 minutes,
causing the SQLAlchemy session to become invalid. Added rollback after
pre-load completes to reset the session before subsequent DB operations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 14:34:44 +01:00
Benjamin Admin
36ef34169a Fix regulation_filter bypass for chunks without regulation_code
Chunks without a regulation_code were silently passing through the filter
in _scan_rag(), causing unrelated documents (e.g. Data Act, legal templates)
to be included in filtered generation jobs. Now chunks without reg_code are
skipped when regulation_filter is active.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 13:38:25 +01:00
11 changed files with 1560 additions and 85 deletions

View File

@@ -0,0 +1,264 @@
'use client'
import { useState, useEffect } from 'react'
import {
ArrowLeft, CheckCircle2, Trash2, Pencil, SkipForward,
ChevronLeft, Scale, BookOpen, ExternalLink, AlertTriangle,
FileText, Clock,
} from 'lucide-react'
import {
CanonicalControl, BACKEND_URL,
SeverityBadge, StateBadge, LicenseRuleBadge, CategoryBadge, TargetAudienceBadge,
} from './helpers'
// =============================================================================
// Compact Control Panel (used on both sides of the comparison)
// =============================================================================
function ControlPanel({ ctrl, label, highlight }: { ctrl: CanonicalControl; label: string; highlight?: boolean }) {
return (
<div className={`flex flex-col h-full overflow-y-auto ${highlight ? 'bg-yellow-50' : 'bg-white'}`}>
{/* Panel Header */}
<div className={`sticky top-0 z-10 px-4 py-3 border-b ${highlight ? 'bg-yellow-100 border-yellow-200' : 'bg-gray-50 border-gray-200'}`}>
<div className="text-xs font-semibold uppercase tracking-wide text-gray-500 mb-1">{label}</div>
<div className="flex items-center gap-2 flex-wrap">
<span className="text-sm font-mono text-purple-600 bg-purple-50 px-2 py-0.5 rounded">{ctrl.control_id}</span>
<SeverityBadge severity={ctrl.severity} />
<StateBadge state={ctrl.release_state} />
<LicenseRuleBadge rule={ctrl.license_rule} />
<CategoryBadge category={ctrl.category} />
<TargetAudienceBadge audience={ctrl.target_audience} />
</div>
<h3 className="text-sm font-semibold text-gray-900 mt-1 leading-snug">{ctrl.title}</h3>
</div>
{/* Panel Content */}
<div className="p-4 space-y-4 text-sm">
{/* Objective */}
<section>
<h4 className="text-xs font-semibold text-gray-500 uppercase tracking-wide mb-1">Ziel</h4>
<p className="text-gray-700 leading-relaxed">{ctrl.objective}</p>
</section>
{/* Rationale */}
{ctrl.rationale && (
<section>
<h4 className="text-xs font-semibold text-gray-500 uppercase tracking-wide mb-1">Begruendung</h4>
<p className="text-gray-700 leading-relaxed">{ctrl.rationale}</p>
</section>
)}
{/* Source Citation (Rule 1+2) */}
{ctrl.source_citation && (
<section className="bg-blue-50 border border-blue-200 rounded-lg p-3">
<div className="flex items-center gap-1.5 mb-1">
<Scale className="w-3.5 h-3.5 text-blue-600" />
<span className="text-xs font-semibold text-blue-900">Gesetzliche Grundlage</span>
</div>
{ctrl.source_citation.source && (
<p className="text-xs text-blue-800">
{ctrl.source_citation.source}
{ctrl.source_citation.article && `${ctrl.source_citation.article}`}
{ctrl.source_citation.paragraph && ` ${ctrl.source_citation.paragraph}`}
</p>
)}
</section>
)}
{/* Requirements */}
{ctrl.requirements.length > 0 && (
<section>
<h4 className="text-xs font-semibold text-gray-500 uppercase tracking-wide mb-1">Anforderungen</h4>
<ol className="list-decimal list-inside space-y-1">
{ctrl.requirements.map((r, i) => (
<li key={i} className="text-gray-700 text-xs leading-relaxed">{r}</li>
))}
</ol>
</section>
)}
{/* Test Procedure */}
{ctrl.test_procedure.length > 0 && (
<section>
<h4 className="text-xs font-semibold text-gray-500 uppercase tracking-wide mb-1">Pruefverfahren</h4>
<ol className="list-decimal list-inside space-y-1">
{ctrl.test_procedure.map((s, i) => (
<li key={i} className="text-gray-700 text-xs leading-relaxed">{s}</li>
))}
</ol>
</section>
)}
{/* Open Anchors */}
{ctrl.open_anchors.length > 0 && (
<section className="bg-green-50 border border-green-200 rounded-lg p-3">
<div className="flex items-center gap-1.5 mb-2">
<BookOpen className="w-3.5 h-3.5 text-green-700" />
<span className="text-xs font-semibold text-green-900">Referenzen ({ctrl.open_anchors.length})</span>
</div>
<div className="space-y-1">
{ctrl.open_anchors.map((a, i) => (
<div key={i} className="flex items-center gap-1.5 text-xs">
<ExternalLink className="w-3 h-3 text-green-600 flex-shrink-0" />
<span className="font-medium text-green-800">{a.framework}</span>
<span className="text-green-700">{a.ref}</span>
</div>
))}
</div>
</section>
)}
{/* Tags */}
{ctrl.tags.length > 0 && (
<div className="flex items-center gap-1 flex-wrap">
{ctrl.tags.map(t => (
<span key={t} className="px-2 py-0.5 bg-gray-100 text-gray-600 rounded text-xs">{t}</span>
))}
</div>
)}
</div>
</div>
)
}
// =============================================================================
// ReviewCompare — Side-by-Side Duplicate Comparison
// =============================================================================
interface ReviewCompareProps {
ctrl: CanonicalControl
onBack: () => void
onReview: (controlId: string, action: string) => void
onEdit: () => void
reviewIndex: number
reviewTotal: number
onReviewPrev: () => void
onReviewNext: () => void
}
export function ReviewCompare({
ctrl,
onBack,
onReview,
onEdit,
reviewIndex,
reviewTotal,
onReviewPrev,
onReviewNext,
}: ReviewCompareProps) {
const [suspectedDuplicate, setSuspectedDuplicate] = useState<CanonicalControl | null>(null)
const [loading, setLoading] = useState(false)
const [similarity, setSimilarity] = useState<number | null>(null)
// Load the suspected duplicate from generation_metadata.similar_controls
useEffect(() => {
const loadDuplicate = async () => {
const similarControls = ctrl.generation_metadata?.similar_controls as Array<{ control_id: string; title: string; similarity: number }> | undefined
if (!similarControls || similarControls.length === 0) {
setSuspectedDuplicate(null)
setSimilarity(null)
return
}
const suspect = similarControls[0]
setSimilarity(suspect.similarity)
setLoading(true)
try {
const res = await fetch(`${BACKEND_URL}?endpoint=control&id=${encodeURIComponent(suspect.control_id)}`)
if (res.ok) {
const data = await res.json()
setSuspectedDuplicate(data)
} else {
setSuspectedDuplicate(null)
}
} catch {
setSuspectedDuplicate(null)
} finally {
setLoading(false)
}
}
loadDuplicate()
}, [ctrl.control_id, ctrl.generation_metadata])
return (
<div className="flex flex-col h-full">
{/* Header */}
<div className="border-b border-gray-200 bg-white px-6 py-3 flex items-center justify-between">
<div className="flex items-center gap-3">
<button onClick={onBack} className="text-gray-400 hover:text-gray-600">
<ArrowLeft className="w-5 h-5" />
</button>
<div>
<div className="flex items-center gap-2">
<AlertTriangle className="w-4 h-4 text-amber-500" />
<span className="text-sm font-semibold text-gray-900">Duplikat-Vergleich</span>
{similarity !== null && (
<span className="text-xs font-medium text-amber-600 bg-amber-50 px-2 py-0.5 rounded-full">
{(similarity * 100).toFixed(1)}% Aehnlichkeit
</span>
)}
</div>
</div>
</div>
<div className="flex items-center gap-2">
{/* Navigation */}
<div className="flex items-center gap-1 mr-3">
<button onClick={onReviewPrev} disabled={reviewIndex === 0} className="p-1 text-gray-400 hover:text-gray-600 disabled:opacity-30">
<ChevronLeft className="w-4 h-4" />
</button>
<span className="text-xs text-gray-500 font-medium">{reviewIndex + 1} / {reviewTotal}</span>
<button onClick={onReviewNext} disabled={reviewIndex >= reviewTotal - 1} className="p-1 text-gray-400 hover:text-gray-600 disabled:opacity-30">
<SkipForward className="w-4 h-4" />
</button>
</div>
{/* Actions */}
<button
onClick={() => onReview(ctrl.control_id, 'approve')}
className="px-3 py-1.5 text-sm text-white bg-green-600 rounded-lg hover:bg-green-700"
>
<CheckCircle2 className="w-3.5 h-3.5 inline mr-1" />Behalten
</button>
<button
onClick={() => onReview(ctrl.control_id, 'reject')}
className="px-3 py-1.5 text-sm text-white bg-red-600 rounded-lg hover:bg-red-700"
>
<Trash2 className="w-3.5 h-3.5 inline mr-1" />Duplikat
</button>
<button
onClick={onEdit}
className="px-3 py-1.5 text-sm text-gray-600 border border-gray-300 rounded-lg hover:bg-gray-50"
>
<Pencil className="w-3.5 h-3.5 inline mr-1" />Bearbeiten
</button>
</div>
</div>
{/* Side-by-Side Panels */}
<div className="flex-1 flex overflow-hidden">
{/* Left: Control to review */}
<div className="w-1/2 border-r border-gray-200 overflow-y-auto">
<ControlPanel ctrl={ctrl} label="Zu pruefen" highlight />
</div>
{/* Right: Suspected duplicate */}
<div className="w-1/2 overflow-y-auto">
{loading ? (
<div className="flex items-center justify-center h-full">
<div className="animate-spin rounded-full h-6 w-6 border-2 border-purple-600 border-t-transparent" />
</div>
) : suspectedDuplicate ? (
<ControlPanel ctrl={suspectedDuplicate} label="Bestehendes Control (Verdacht)" />
) : (
<div className="flex items-center justify-center h-full text-gray-400 text-sm">
Kein Duplikat-Kandidat gefunden
</div>
)}
</div>
</div>
</div>
)
}

View File

@@ -14,6 +14,7 @@ import {
} from './components/helpers' } from './components/helpers'
import { ControlForm } from './components/ControlForm' import { ControlForm } from './components/ControlForm'
import { ControlDetail } from './components/ControlDetail' import { ControlDetail } from './components/ControlDetail'
import { ReviewCompare } from './components/ReviewCompare'
import { GeneratorModal } from './components/GeneratorModal' import { GeneratorModal } from './components/GeneratorModal'
// ============================================================================= // =============================================================================
@@ -71,6 +72,9 @@ export default function ControlLibraryPage() {
const [reviewIndex, setReviewIndex] = useState(0) const [reviewIndex, setReviewIndex] = useState(0)
const [reviewItems, setReviewItems] = useState<CanonicalControl[]>([]) const [reviewItems, setReviewItems] = useState<CanonicalControl[]>([])
const [reviewCount, setReviewCount] = useState(0) const [reviewCount, setReviewCount] = useState(0)
const [reviewTab, setReviewTab] = useState<'duplicates' | 'rule3'>('duplicates')
const [reviewDuplicates, setReviewDuplicates] = useState<CanonicalControl[]>([])
const [reviewRule3, setReviewRule3] = useState<CanonicalControl[]>([])
// Debounce search // Debounce search
const searchTimer = useRef<ReturnType<typeof setTimeout> | null>(null) const searchTimer = useRef<ReturnType<typeof setTimeout> | null>(null)
@@ -303,20 +307,47 @@ export default function ControlLibraryPage() {
const enterReviewMode = async () => { const enterReviewMode = async () => {
// Load review items from backend // Load review items from backend
try { try {
const res = await fetch(`${BACKEND_URL}?endpoint=controls&release_state=needs_review&limit=200`) const res = await fetch(`${BACKEND_URL}?endpoint=controls&release_state=needs_review&limit=1000`)
if (res.ok) { if (res.ok) {
const items = await res.json() const items: CanonicalControl[] = await res.json()
if (items.length > 0) { if (items.length > 0) {
setReviewItems(items) // Split into duplicate suspects vs rule 3 without anchor
const dupes = items.filter(c =>
c.generation_metadata?.similar_controls &&
Array.isArray(c.generation_metadata.similar_controls) &&
(c.generation_metadata.similar_controls as unknown[]).length > 0
)
const rule3 = items.filter(c =>
!c.generation_metadata?.similar_controls ||
!Array.isArray(c.generation_metadata.similar_controls) ||
(c.generation_metadata.similar_controls as unknown[]).length === 0
)
setReviewDuplicates(dupes)
setReviewRule3(rule3)
// Start with duplicates tab if any, otherwise rule3
const startTab = dupes.length > 0 ? 'duplicates' : 'rule3'
const startItems = startTab === 'duplicates' ? dupes : rule3
setReviewTab(startTab)
setReviewItems(startItems)
setReviewMode(true) setReviewMode(true)
setReviewIndex(0) setReviewIndex(0)
setSelectedControl(items[0]) setSelectedControl(startItems[0])
setMode('detail') setMode('detail')
} }
} }
} catch { /* ignore */ } } catch { /* ignore */ }
} }
const switchReviewTab = (tab: 'duplicates' | 'rule3') => {
const items = tab === 'duplicates' ? reviewDuplicates : reviewRule3
setReviewTab(tab)
setReviewItems(items)
setReviewIndex(0)
if (items.length > 0) {
setSelectedControl(items[0])
}
}
// Loading // Loading
if (loading && controls.length === 0) { if (loading && controls.length === 0) {
return ( return (
@@ -363,7 +394,66 @@ export default function ControlLibraryPage() {
// DETAIL MODE // DETAIL MODE
if (mode === 'detail' && selectedControl) { if (mode === 'detail' && selectedControl) {
const isDuplicateReview = reviewMode && reviewTab === 'duplicates'
// Review tab bar (shown above the detail/compare view in review mode)
const reviewTabBar = reviewMode ? (
<div className="border-b border-gray-200 bg-white px-6 py-2 flex items-center gap-4">
<button
onClick={() => switchReviewTab('duplicates')}
className={`px-3 py-1.5 text-sm rounded-lg font-medium ${
reviewTab === 'duplicates'
? 'bg-amber-100 text-amber-800 border border-amber-300'
: 'text-gray-500 hover:text-gray-700 hover:bg-gray-100'
}`}
>
Duplikat-Verdacht ({reviewDuplicates.length})
</button>
<button
onClick={() => switchReviewTab('rule3')}
className={`px-3 py-1.5 text-sm rounded-lg font-medium ${
reviewTab === 'rule3'
? 'bg-purple-100 text-purple-800 border border-purple-300'
: 'text-gray-500 hover:text-gray-700 hover:bg-gray-100'
}`}
>
Rule 3 ohne Anchor ({reviewRule3.length})
</button>
</div>
) : null
if (isDuplicateReview) {
return ( return (
<div className="flex flex-col h-full">
{reviewTabBar}
<div className="flex-1 overflow-hidden">
<ReviewCompare
ctrl={selectedControl}
onBack={() => { setMode('list'); setSelectedControl(null); setReviewMode(false) }}
onReview={handleReview}
onEdit={() => setMode('edit')}
reviewIndex={reviewIndex}
reviewTotal={reviewItems.length}
onReviewPrev={() => {
const idx = Math.max(0, reviewIndex - 1)
setReviewIndex(idx)
setSelectedControl(reviewItems[idx])
}}
onReviewNext={() => {
const idx = Math.min(reviewItems.length - 1, reviewIndex + 1)
setReviewIndex(idx)
setSelectedControl(reviewItems[idx])
}}
/>
</div>
</div>
)
}
return (
<div className="flex flex-col h-full">
{reviewTabBar}
<div className="flex-1 overflow-hidden">
<ControlDetail <ControlDetail
ctrl={selectedControl} ctrl={selectedControl}
onBack={() => { setMode('list'); setSelectedControl(null); setReviewMode(false) }} onBack={() => { setMode('list'); setSelectedControl(null); setReviewMode(false) }}
@@ -385,6 +475,8 @@ export default function ControlLibraryPage() {
setSelectedControl(reviewItems[idx]) setSelectedControl(reviewItems[idx])
}} }}
/> />
</div>
</div>
) )
} }

View File

@@ -54,6 +54,7 @@ class GenerateRequest(BaseModel):
skip_web_search: bool = False skip_web_search: bool = False
dry_run: bool = False dry_run: bool = False
regulation_filter: Optional[List[str]] = None # Only process these regulation_code prefixes regulation_filter: Optional[List[str]] = None # Only process these regulation_code prefixes
skip_prefilter: bool = False # Skip local LLM pre-filter, send all chunks to API
class GenerateResponse(BaseModel): class GenerateResponse(BaseModel):
@@ -146,6 +147,7 @@ async def start_generation(req: GenerateRequest):
skip_web_search=req.skip_web_search, skip_web_search=req.skip_web_search,
dry_run=req.dry_run, dry_run=req.dry_run,
regulation_filter=req.regulation_filter, regulation_filter=req.regulation_filter,
skip_prefilter=req.skip_prefilter,
) )
if req.dry_run: if req.dry_run:

View File

@@ -53,6 +53,11 @@ LLM_TIMEOUT = float(os.getenv("CONTROL_GEN_LLM_TIMEOUT", "180"))
HARMONIZATION_THRESHOLD = 0.85 # Cosine similarity above this = duplicate HARMONIZATION_THRESHOLD = 0.85 # Cosine similarity above this = duplicate
# Pipeline version — increment when generation rules change materially.
# v1: Original (local LLM prefilter, old prompt)
# v2: Anthropic decides relevance, null for non-requirement chunks, annexes protected
PIPELINE_VERSION = 2
ALL_COLLECTIONS = [ ALL_COLLECTIONS = [
"bp_compliance_ce", "bp_compliance_ce",
"bp_compliance_gesetze", "bp_compliance_gesetze",
@@ -316,6 +321,62 @@ VALID_CATEGORIES = set(CATEGORY_KEYWORDS.keys())
VALID_DOMAINS = {"AUTH", "CRYP", "NET", "DATA", "LOG", "ACC", "SEC", "INC", VALID_DOMAINS = {"AUTH", "CRYP", "NET", "DATA", "LOG", "ACC", "SEC", "INC",
"AI", "COMP", "GOV", "LAB", "FIN", "TRD", "ENV", "HLT"} "AI", "COMP", "GOV", "LAB", "FIN", "TRD", "ENV", "HLT"}
# ---------------------------------------------------------------------------
# Recital (Erwägungsgrund) detection in source text
# ---------------------------------------------------------------------------
# Pattern: standalone recital number like (125)\n or (126) at line start
_RECITAL_RE = re.compile(r'\((\d{1,3})\)\s*\n')
# Recital-typical phrasing (German EU law Erwägungsgründe)
_RECITAL_PHRASES = [
"in erwägung nachstehender gründe",
"erwägungsgrund",
"in anbetracht",
"daher sollte",
"aus diesem grund",
"es ist daher",
"folglich sollte",
"es sollte daher",
"in diesem zusammenhang",
]
def _detect_recital(text: str) -> Optional[dict]:
"""Detect if source text is a recital (Erwägungsgrund) rather than an article.
Returns a dict with detection details if recital markers are found,
or None if the text appears to be genuine article text.
Detection criteria:
1. Standalone recital numbers like (126)\\n in the text
2. Recital-typical phrasing ("daher sollte", "erwägungsgrund", etc.)
"""
if not text:
return None
# Check 1: Recital number markers
recital_matches = _RECITAL_RE.findall(text)
# Check 2: Recital phrasing
text_lower = text.lower()
phrase_hits = [p for p in _RECITAL_PHRASES if p in text_lower]
if not recital_matches and not phrase_hits:
return None
# Require at least recital numbers OR >=2 phrase hits to be a suspect
if not recital_matches and len(phrase_hits) < 2:
return None
return {
"recital_suspect": True,
"recital_numbers": recital_matches[:10],
"recital_phrases": phrase_hits[:5],
"detection_method": "regex+phrases" if recital_matches and phrase_hits
else "regex" if recital_matches else "phrases",
}
CATEGORY_LIST_STR = ", ".join(sorted(VALID_CATEGORIES)) CATEGORY_LIST_STR = ", ".join(sorted(VALID_CATEGORIES))
VERIFICATION_KEYWORDS = { VERIFICATION_KEYWORDS = {
@@ -385,6 +446,7 @@ class GeneratorConfig(BaseModel):
dry_run: bool = False dry_run: bool = False
existing_job_id: Optional[str] = None # If set, reuse this job instead of creating a new one existing_job_id: Optional[str] = None # If set, reuse this job instead of creating a new one
regulation_filter: Optional[List[str]] = None # Only process chunks matching these regulation_code prefixes regulation_filter: Optional[List[str]] = None # Only process chunks matching these regulation_code prefixes
skip_prefilter: bool = False # If True, skip local LLM pre-filter (send all chunks to API)
@dataclass @dataclass
@@ -806,7 +868,9 @@ class ControlGeneratorPipeline:
or payload.get("source_code", "")) or payload.get("source_code", ""))
# Filter by regulation_code if configured # Filter by regulation_code if configured
if config.regulation_filter and reg_code: if config.regulation_filter:
if not reg_code:
continue # Skip chunks without regulation code
code_lower = reg_code.lower() code_lower = reg_code.lower()
if not any(code_lower.startswith(f.lower()) for f in config.regulation_filter): if not any(code_lower.startswith(f.lower()) for f in config.regulation_filter):
continue continue
@@ -852,6 +916,12 @@ class ControlGeneratorPipeline:
collection, collection_total, collection_new, collection, collection_total, collection_new,
) )
if config.regulation_filter:
logger.info(
"RAG scroll complete: %d total unique seen, %d passed regulation_filter %s",
len(seen_hashes), len(all_results), config.regulation_filter,
)
else:
logger.info( logger.info(
"RAG scroll complete: %d total unique seen, %d new unprocessed to process", "RAG scroll complete: %d total unique seen, %d new unprocessed to process",
len(seen_hashes), len(all_results), len(seen_hashes), len(all_results),
@@ -1093,12 +1163,15 @@ Gib JSON zurück mit diesen Feldern:
Du DARFST den Originaltext verwenden (Quellen sind jeweils angegeben). Du DARFST den Originaltext verwenden (Quellen sind jeweils angegeben).
{doc_context} {doc_context}
WICHTIG: WICHTIG:
- Erstelle fuer JEDEN Chunk ein separates Control mit verstaendlicher, praxisorientierter Formulierung. - Pruefe JEDEN Chunk: Enthaelt er eine konkrete Pflicht, Anforderung oder Massnahme?
- Wenn JA: Erstelle ein vollstaendiges, eigenstaendiges Control mit praxisorientierter Formulierung.
- Wenn NEIN (reines Inhaltsverzeichnis, Begriffsbestimmung ohne Pflicht, Geltungsbereich ohne Anforderung, reine Verweiskette): Gib null fuer diesen Chunk zurueck.
- BEACHTE: Anhaenge/Annexe enthalten oft KONKRETE technische Anforderungen — diese MUESSEN als Control erfasst werden!
- Jedes Control muss eigenstaendig und vollstaendig sein — nicht auf andere Controls verweisen. - Jedes Control muss eigenstaendig und vollstaendig sein — nicht auf andere Controls verweisen.
- Qualitaet ist wichtiger als Geschwindigkeit. Jedes Control muss die gleiche Qualitaet haben wie ein einzeln erstelltes. - Qualitaet ist wichtiger als Geschwindigkeit.
- Antworte IMMER auf Deutsch. - Antworte IMMER auf Deutsch.
Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Objekten. Jedes Objekt hat diese Felder: Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Chunks ohne Anforderung gib null zurueck. Fuer Chunks mit Anforderung ein Objekt mit diesen Feldern:
- chunk_index: 1-basierter Index des Chunks (1, 2, 3, ...) - chunk_index: 1-basierter Index des Chunks (1, 2, 3, ...)
- title: Kurzer praegnanter Titel auf Deutsch (max 100 Zeichen) - title: Kurzer praegnanter Titel auf Deutsch (max 100 Zeichen)
- objective: Was soll erreicht werden? (1-3 Saetze, Deutsch) - objective: Was soll erreicht werden? (1-3 Saetze, Deutsch)
@@ -1122,7 +1195,12 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Objekten. Jedes Objekt hat di
# Map results back to chunks by chunk_index (or by position if no index) # Map results back to chunks by chunk_index (or by position if no index)
controls: list[Optional[GeneratedControl]] = [None] * len(chunks) controls: list[Optional[GeneratedControl]] = [None] * len(chunks)
skipped_by_api = 0
for pos, data in enumerate(results): for pos, data in enumerate(results):
# API returns null for chunks without actionable requirements
if data is None:
skipped_by_api += 1
continue
# Try chunk_index first, fall back to position # Try chunk_index first, fall back to position
idx = data.get("chunk_index") idx = data.get("chunk_index")
if idx is not None: if idx is not None:
@@ -1186,15 +1264,19 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Objekten. Jedes Objekt hat di
f"Text (nur zur Analyse, NICHT kopieren, NICHT referenzieren):\n{chunk.text[:1500]}" f"Text (nur zur Analyse, NICHT kopieren, NICHT referenzieren):\n{chunk.text[:1500]}"
) )
joined = "\n\n".join(chunk_entries) joined = "\n\n".join(chunk_entries)
prompt = f"""Analysiere die folgenden {len(chunks)} Pruefaspekte und formuliere fuer JEDEN ein EIGENSTAENDIGES Security Control. prompt = f"""Analysiere die folgenden {len(chunks)} Pruefaspekte und formuliere fuer JEDEN mit konkreter Anforderung ein EIGENSTAENDIGES Security Control.
KOPIERE KEINE Saetze. Verwende eigene Begriffe und Struktur. KOPIERE KEINE Saetze. Verwende eigene Begriffe und Struktur.
NENNE NICHT die Quellen. Keine proprietaeren Bezeichner (kein O.Auth_*, TR-03161, BSI-TR etc.). NENNE NICHT die Quellen. Keine proprietaeren Bezeichner (kein O.Auth_*, TR-03161, BSI-TR etc.).
WICHTIG: WICHTIG:
- Pruefe JEDEN Aspekt: Enthaelt er eine konkrete Pflicht, Anforderung oder Massnahme?
- Wenn JA: Erstelle ein vollstaendiges, eigenstaendiges Control.
- Wenn NEIN (reines Inhaltsverzeichnis, Begriffsbestimmung ohne Pflicht, Geltungsbereich ohne Anforderung): Gib null fuer diesen Aspekt zurueck.
- BEACHTE: Anhaenge/Annexe enthalten oft KONKRETE technische Anforderungen — diese MUESSEN erfasst werden!
- Jedes Control muss eigenstaendig und vollstaendig sein — nicht auf andere Controls verweisen. - Jedes Control muss eigenstaendig und vollstaendig sein — nicht auf andere Controls verweisen.
- Qualitaet ist wichtiger als Geschwindigkeit. Jedes Control muss die gleiche Qualitaet haben wie ein einzeln erstelltes. - Qualitaet ist wichtiger als Geschwindigkeit.
Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Objekten. Jedes Objekt hat diese Felder: Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Aspekte ohne Anforderung gib null zurueck. Fuer Aspekte mit Anforderung ein Objekt mit diesen Feldern:
- chunk_index: 1-basierter Index des Aspekts (1, 2, 3, ...) - chunk_index: 1-basierter Index des Aspekts (1, 2, 3, ...)
- title: Kurzer eigenstaendiger Titel (max 100 Zeichen) - title: Kurzer eigenstaendiger Titel (max 100 Zeichen)
- objective: Eigenstaendige Formulierung des Ziels (1-3 Saetze) - objective: Eigenstaendige Formulierung des Ziels (1-3 Saetze)
@@ -1216,6 +1298,8 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Objekten. Jedes Objekt hat di
controls: list[Optional[GeneratedControl]] = [None] * len(chunks) controls: list[Optional[GeneratedControl]] = [None] * len(chunks)
for pos, data in enumerate(results): for pos, data in enumerate(results):
if data is None:
continue
idx = data.get("chunk_index") idx = data.get("chunk_index")
if idx is not None: if idx is not None:
idx = int(idx) - 1 idx = int(idx) - 1
@@ -1383,6 +1467,12 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Objekten. Jedes Objekt hat di
loaded = sum(1 for emb in embeddings if emb) loaded = sum(1 for emb in embeddings if emb)
logger.info("Pre-loaded %d/%d embeddings", loaded, len(texts)) logger.info("Pre-loaded %d/%d embeddings", loaded, len(texts))
# Reset DB session after long-running embedding operation to avoid stale connections
try:
self.db.rollback()
except Exception:
pass
def _load_existing_controls(self) -> list[dict]: def _load_existing_controls(self) -> list[dict]:
"""Load existing controls from DB (cached per pipeline run).""" """Load existing controls from DB (cached per pipeline run)."""
if self._existing_controls is not None: if self._existing_controls is not None:
@@ -1486,9 +1576,23 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Objekten. Jedes Objekt hat di
) -> tuple[GeneratedControl, bool]: ) -> tuple[GeneratedControl, bool]:
"""Cross-validate category/domain using keyword detection + local LLM. """Cross-validate category/domain using keyword detection + local LLM.
Also checks for recital (Erwägungsgrund) contamination in source text.
Returns (control, was_fixed). Only triggers Ollama QA when the LLM Returns (control, was_fixed). Only triggers Ollama QA when the LLM
classification disagrees with keyword detection — keeps it fast. classification disagrees with keyword detection — keeps it fast.
""" """
# ── Recital detection ──────────────────────────────────────────
source_text = control.source_original_text or ""
recital_info = _detect_recital(source_text)
if recital_info:
control.generation_metadata["recital_suspect"] = True
control.generation_metadata["recital_detection"] = recital_info
control.release_state = "needs_review"
logger.warning(
"Recital suspect: '%s' — recitals %s detected in source text",
control.title[:40],
recital_info.get("recital_numbers", []),
)
kw_category = _detect_category(chunk_text) or _detect_category(control.objective) kw_category = _detect_category(chunk_text) or _detect_category(control.objective)
kw_domain = _detect_domain(chunk_text) kw_domain = _detect_domain(chunk_text)
llm_domain = control.generation_metadata.get("_effective_domain", "") llm_domain = control.generation_metadata.get("_effective_domain", "")
@@ -1634,7 +1738,7 @@ Kategorien: {CATEGORY_LIST_STR}"""
license_rule, source_original_text, source_citation, license_rule, source_original_text, source_citation,
customer_visible, generation_metadata, customer_visible, generation_metadata,
verification_method, category, generation_strategy, verification_method, category, generation_strategy,
target_audience target_audience, pipeline_version
) VALUES ( ) VALUES (
:framework_id, :control_id, :title, :objective, :rationale, :framework_id, :control_id, :title, :objective, :rationale,
:scope, :requirements, :test_procedure, :evidence, :scope, :requirements, :test_procedure, :evidence,
@@ -1643,7 +1747,7 @@ Kategorien: {CATEGORY_LIST_STR}"""
:license_rule, :source_original_text, :source_citation, :license_rule, :source_original_text, :source_citation,
:customer_visible, :generation_metadata, :customer_visible, :generation_metadata,
:verification_method, :category, :generation_strategy, :verification_method, :category, :generation_strategy,
:target_audience :target_audience, :pipeline_version
) )
ON CONFLICT (framework_id, control_id) DO NOTHING ON CONFLICT (framework_id, control_id) DO NOTHING
RETURNING id RETURNING id
@@ -1673,6 +1777,7 @@ Kategorien: {CATEGORY_LIST_STR}"""
"category": control.category, "category": control.category,
"generation_strategy": control.generation_strategy, "generation_strategy": control.generation_strategy,
"target_audience": json.dumps(control.target_audience) if control.target_audience else None, "target_audience": json.dumps(control.target_audience) if control.target_audience else None,
"pipeline_version": PIPELINE_VERSION,
}, },
) )
self.db.commit() self.db.commit()
@@ -1699,11 +1804,13 @@ Kategorien: {CATEGORY_LIST_STR}"""
INSERT INTO canonical_processed_chunks ( INSERT INTO canonical_processed_chunks (
chunk_hash, collection, regulation_code, chunk_hash, collection, regulation_code,
document_version, source_license, license_rule, document_version, source_license, license_rule,
processing_path, generated_control_ids, job_id processing_path, generated_control_ids, job_id,
pipeline_version
) VALUES ( ) VALUES (
:hash, :collection, :regulation_code, :hash, :collection, :regulation_code,
:doc_version, :license, :rule, :doc_version, :license, :rule,
:path, :control_ids, CAST(:job_id AS uuid) :path, :control_ids, CAST(:job_id AS uuid),
:pipeline_version
) )
ON CONFLICT (chunk_hash, collection, document_version) DO NOTHING ON CONFLICT (chunk_hash, collection, document_version) DO NOTHING
"""), """),
@@ -1717,6 +1824,7 @@ Kategorien: {CATEGORY_LIST_STR}"""
"path": processing_path, "path": processing_path,
"control_ids": json.dumps(control_ids), "control_ids": json.dumps(control_ids),
"job_id": job_id, "job_id": job_id,
"pipeline_version": PIPELINE_VERSION,
}, },
) )
self.db.commit() self.db.commit()
@@ -1872,7 +1980,7 @@ Kategorien: {CATEGORY_LIST_STR}"""
self._update_job(job_id, result) self._update_job(job_id, result)
# Stage 1.5: Local LLM pre-filter — skip chunks without requirements # Stage 1.5: Local LLM pre-filter — skip chunks without requirements
if not config.dry_run: if not config.dry_run and not config.skip_prefilter:
is_relevant, prefilter_reason = await _prefilter_chunk(chunk.text) is_relevant, prefilter_reason = await _prefilter_chunk(chunk.text)
if not is_relevant: if not is_relevant:
chunks_skipped_prefilter += 1 chunks_skipped_prefilter += 1

View File

@@ -0,0 +1,22 @@
-- Migration 062: Add pipeline_version to track which generation rules produced each control/chunk
--
-- v1 = Original pipeline (local LLM prefilter, old prompt without null-skip)
-- v2 = Improved pipeline (skip_prefilter, Anthropic decides relevance, annexes protected)
--
-- This allows identifying controls that may need reprocessing when pipeline rules change.
ALTER TABLE canonical_controls
ADD COLUMN IF NOT EXISTS pipeline_version smallint NOT NULL DEFAULT 1;
ALTER TABLE canonical_processed_chunks
ADD COLUMN IF NOT EXISTS pipeline_version smallint NOT NULL DEFAULT 1;
-- Index for efficient querying by version
CREATE INDEX IF NOT EXISTS idx_canonical_controls_pipeline_version
ON canonical_controls (pipeline_version);
CREATE INDEX IF NOT EXISTS idx_canonical_processed_chunks_pipeline_version
ON canonical_processed_chunks (pipeline_version);
COMMENT ON COLUMN canonical_controls.pipeline_version IS 'Generation pipeline version: 1=original (local prefilter), 2=improved (Anthropic decides relevance, annexes protected)';
COMMENT ON COLUMN canonical_processed_chunks.pipeline_version IS 'Pipeline version used when this chunk was processed';

View File

@@ -7,11 +7,14 @@ from unittest.mock import AsyncMock, MagicMock, patch
from compliance.services.control_generator import ( from compliance.services.control_generator import (
_classify_regulation, _classify_regulation,
_detect_domain, _detect_domain,
_detect_recital,
_parse_llm_json, _parse_llm_json,
_parse_llm_json_array,
GeneratorConfig, GeneratorConfig,
GeneratedControl, GeneratedControl,
ControlGeneratorPipeline, ControlGeneratorPipeline,
REGULATION_LICENSE_MAP, REGULATION_LICENSE_MAP,
PIPELINE_VERSION,
) )
from compliance.services.anchor_finder import AnchorFinder, OpenAnchor from compliance.services.anchor_finder import AnchorFinder, OpenAnchor
from compliance.services.rag_client import RAGSearchResult from compliance.services.rag_client import RAGSearchResult
@@ -91,7 +94,7 @@ class TestDomainDetection:
assert _detect_domain("Multi-factor authentication and password policy") == "AUTH" assert _detect_domain("Multi-factor authentication and password policy") == "AUTH"
def test_crypto_domain(self): def test_crypto_domain(self):
assert _detect_domain("TLS 1.3 encryption and certificate management") == "CRYPT" assert _detect_domain("TLS 1.3 encryption and certificate management") == "CRYP"
def test_network_domain(self): def test_network_domain(self):
assert _detect_domain("Firewall rules and network segmentation") == "NET" assert _detect_domain("Firewall rules and network segmentation") == "NET"
@@ -807,7 +810,7 @@ class TestBatchProcessingLoop:
patch.object(pipeline, "_check_harmonization", new_callable=AsyncMock, return_value=[]), \ patch.object(pipeline, "_check_harmonization", new_callable=AsyncMock, return_value=[]), \
patch("compliance.services.anchor_finder.AnchorFinder", mock_finder_cls): patch("compliance.services.anchor_finder.AnchorFinder", mock_finder_cls):
config = GeneratorConfig() config = GeneratorConfig()
result = await pipeline._process_batch(batch_items, config, "job-1") result, qa_count = await pipeline._process_batch(batch_items, config, "job-1")
mock_struct.assert_called_once() mock_struct.assert_called_once()
mock_reform.assert_not_called() mock_reform.assert_not_called()
@@ -839,7 +842,7 @@ class TestBatchProcessingLoop:
patch("compliance.services.control_generator.check_similarity", new_callable=AsyncMock) as mock_sim: patch("compliance.services.control_generator.check_similarity", new_callable=AsyncMock) as mock_sim:
mock_sim.return_value = MagicMock(status="PASS", token_overlap=0.1, ngram_jaccard=0.1, lcs_ratio=0.1) mock_sim.return_value = MagicMock(status="PASS", token_overlap=0.1, ngram_jaccard=0.1, lcs_ratio=0.1)
config = GeneratorConfig() config = GeneratorConfig()
result = await pipeline._process_batch(batch_items, config, "job-2") result, qa_count = await pipeline._process_batch(batch_items, config, "job-2")
mock_struct.assert_not_called() mock_struct.assert_not_called()
mock_reform.assert_called_once() mock_reform.assert_called_once()
@@ -885,7 +888,7 @@ class TestBatchProcessingLoop:
patch("compliance.services.control_generator.check_similarity", new_callable=AsyncMock) as mock_sim: patch("compliance.services.control_generator.check_similarity", new_callable=AsyncMock) as mock_sim:
mock_sim.return_value = MagicMock(status="PASS", token_overlap=0.05, ngram_jaccard=0.05, lcs_ratio=0.05) mock_sim.return_value = MagicMock(status="PASS", token_overlap=0.05, ngram_jaccard=0.05, lcs_ratio=0.05)
config = GeneratorConfig() config = GeneratorConfig()
result = await pipeline._process_batch(batch_items, config, "job-mixed") result, qa_count = await pipeline._process_batch(batch_items, config, "job-mixed")
# Both methods called # Both methods called
mock_struct.assert_called_once() mock_struct.assert_called_once()
@@ -905,8 +908,9 @@ class TestBatchProcessingLoop:
pipeline._existing_controls = [] pipeline._existing_controls = []
config = GeneratorConfig() config = GeneratorConfig()
result = await pipeline._process_batch([], config, "job-empty") result, qa_count = await pipeline._process_batch([], config, "job-empty")
assert result == [] assert result == []
assert qa_count == 0
@pytest.mark.asyncio @pytest.mark.asyncio
async def test_reformulate_batch_too_close_flagged(self): async def test_reformulate_batch_too_close_flagged(self):
@@ -942,7 +946,7 @@ class TestBatchProcessingLoop:
patch("compliance.services.anchor_finder.AnchorFinder", mock_finder_cls), \ patch("compliance.services.anchor_finder.AnchorFinder", mock_finder_cls), \
patch("compliance.services.control_generator.check_similarity", new_callable=AsyncMock, return_value=fail_report): patch("compliance.services.control_generator.check_similarity", new_callable=AsyncMock, return_value=fail_report):
config = GeneratorConfig() config = GeneratorConfig()
result = await pipeline._process_batch(batch_items, config, "job-tooclose") result, qa_count = await pipeline._process_batch(batch_items, config, "job-tooclose")
assert len(result) == 1 assert len(result) == 1
assert result[0].release_state == "too_close" assert result[0].release_state == "too_close"
@@ -1022,6 +1026,54 @@ class TestRegulationFilter:
assert "nist_sp_800_218" in codes assert "nist_sp_800_218" in codes
assert "amlr" not in codes assert "amlr" not in codes
@pytest.mark.asyncio
async def test_scan_rag_filters_out_empty_regulation_code(self):
"""Chunks without regulation_code must be skipped when filter is active."""
mock_db = MagicMock()
mock_db.execute.return_value = MagicMock()
mock_db.execute.return_value.__iter__ = MagicMock(return_value=iter([]))
qdrant_points = {
"result": {
"points": [
{"id": "1", "payload": {
"chunk_text": "OWASP ASVS requirement for input validation " * 5,
"regulation_code": "owasp_asvs",
}},
{"id": "2", "payload": {
"chunk_text": "Some template without regulation code at all " * 5,
# No regulation_id, regulation_code, source_id, or source_code
}},
{"id": "3", "payload": {
"chunk_text": "Another chunk with empty regulation code value " * 5,
"regulation_code": "",
}},
],
"next_page_offset": None,
}
}
with patch("compliance.services.control_generator.httpx.AsyncClient") as mock_client_cls:
mock_client = AsyncMock()
mock_client_cls.return_value.__aenter__ = AsyncMock(return_value=mock_client)
mock_client_cls.return_value.__aexit__ = AsyncMock(return_value=False)
mock_resp = MagicMock()
mock_resp.status_code = 200
mock_resp.json.return_value = qdrant_points
mock_client.post.return_value = mock_resp
pipeline = ControlGeneratorPipeline(db=mock_db, rag_client=MagicMock())
config = GeneratorConfig(
collections=["bp_compliance_ce"],
regulation_filter=["owasp_"],
)
results = await pipeline._scan_rag(config)
# Only the owasp chunk should pass — empty reg_code chunks are filtered out
assert len(results) == 1
assert results[0].regulation_code == "owasp_asvs"
@pytest.mark.asyncio @pytest.mark.asyncio
async def test_scan_rag_no_filter_returns_all(self): async def test_scan_rag_no_filter_returns_all(self):
"""Verify _scan_rag returns all chunks when no regulation_filter.""" """Verify _scan_rag returns all chunks when no regulation_filter."""
@@ -1064,3 +1116,283 @@ class TestRegulationFilter:
results = await pipeline._scan_rag(config) results = await pipeline._scan_rag(config)
assert len(results) == 2 assert len(results) == 2
# =============================================================================
# Pipeline Version Tests
# =============================================================================
class TestPipelineVersion:
"""Tests for pipeline_version propagation in DB writes and null handling."""
def test_pipeline_version_constant_is_2(self):
assert PIPELINE_VERSION == 2
def test_store_control_includes_pipeline_version(self):
"""_store_control must pass pipeline_version=PIPELINE_VERSION to the INSERT."""
mock_db = MagicMock()
# Framework lookup returns a UUID
fw_row = MagicMock()
fw_row.__getitem__ = lambda self, idx: "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
mock_db.execute.return_value.fetchone.return_value = fw_row
pipeline = ControlGeneratorPipeline(db=mock_db, rag_client=MagicMock())
control = GeneratedControl(
control_id="SEC-TEST-001",
title="Test Control",
objective="Test objective",
)
pipeline._store_control(control, job_id="00000000-0000-0000-0000-000000000001")
# The second call to db.execute is the INSERT
calls = mock_db.execute.call_args_list
assert len(calls) >= 2, f"Expected at least 2 db.execute calls, got {len(calls)}"
insert_call = calls[1]
params = insert_call[0][1] # positional arg 1 = params dict
assert "pipeline_version" in params
assert params["pipeline_version"] == PIPELINE_VERSION
def test_mark_chunk_processed_includes_pipeline_version(self):
"""_mark_chunk_processed must pass pipeline_version=PIPELINE_VERSION to the INSERT."""
mock_db = MagicMock()
pipeline = ControlGeneratorPipeline(db=mock_db, rag_client=MagicMock())
chunk = MagicMock()
chunk.text = "Some chunk text for hashing"
chunk.collection = "bp_compliance_ce"
chunk.regulation_code = "eu_2016_679"
license_info = {"license": "CC0-1.0", "rule": 1}
pipeline._mark_chunk_processed(
chunk=chunk,
license_info=license_info,
processing_path="structured_batch",
control_ids=["SEC-TEST-001"],
job_id="00000000-0000-0000-0000-000000000001",
)
calls = mock_db.execute.call_args_list
assert len(calls) >= 1
insert_call = calls[0]
params = insert_call[0][1]
assert "pipeline_version" in params
assert params["pipeline_version"] == PIPELINE_VERSION
@pytest.mark.asyncio
async def test_structure_batch_handles_null_results(self):
"""When _parse_llm_json_array returns [dict, None, dict], the null entries produce None."""
mock_db = MagicMock()
pipeline = ControlGeneratorPipeline(db=mock_db, rag_client=MagicMock())
# Three chunks
chunks = []
license_infos = []
for i in range(3):
c = MagicMock()
c.text = f"Chunk text number {i} with enough content for processing"
c.regulation_name = "DSGVO"
c.regulation_code = "eu_2016_679"
c.article = f"Art. {i + 1}"
c.paragraph = ""
c.source_url = ""
c.collection = "bp_compliance_ce"
chunks.append(c)
license_infos.append({"rule": 1, "name": "DSGVO", "license": "CC0-1.0"})
# LLM returns a JSON array: valid, null, valid
llm_response = json.dumps([
{
"chunk_index": 1,
"title": "Datenschutz-Kontrolle 1",
"objective": "Schutz personenbezogener Daten",
"rationale": "DSGVO-Konformitaet",
"requirements": ["Req 1"],
"test_procedure": ["Test 1"],
"evidence": ["Nachweis 1"],
"severity": "high",
"tags": ["dsgvo"],
"domain": "DATA",
"category": "datenschutz",
"target_audience": ["unternehmen"],
"source_article": "Art. 1",
"source_paragraph": "",
},
None,
{
"chunk_index": 3,
"title": "Datenschutz-Kontrolle 3",
"objective": "Transparenzpflicht",
"rationale": "Information der Betroffenen",
"requirements": ["Req 3"],
"test_procedure": ["Test 3"],
"evidence": ["Nachweis 3"],
"severity": "medium",
"tags": ["transparenz"],
"domain": "DATA",
"category": "datenschutz",
"target_audience": ["unternehmen"],
"source_article": "Art. 3",
"source_paragraph": "",
},
])
with patch("compliance.services.control_generator._llm_chat", new_callable=AsyncMock) as mock_llm:
mock_llm.return_value = llm_response
controls = await pipeline._structure_batch(chunks, license_infos)
assert len(controls) == 3
assert controls[0] is not None
assert controls[1] is None # Null entry from LLM
assert controls[2] is not None
@pytest.mark.asyncio
async def test_reformulate_batch_handles_null_results(self):
"""When _parse_llm_json_array returns [dict, None, dict], the null entries produce None."""
mock_db = MagicMock()
pipeline = ControlGeneratorPipeline(db=mock_db, rag_client=MagicMock())
chunks = []
for i in range(3):
c = MagicMock()
c.text = f"Restricted chunk text number {i} with BSI content"
c.regulation_name = "BSI TR-03161"
c.regulation_code = "bsi_tr03161"
c.article = f"Section {i + 1}"
c.paragraph = ""
c.source_url = ""
c.collection = "bp_compliance_ce"
chunks.append(c)
config = GeneratorConfig(domain="SEC")
llm_response = json.dumps([
{
"chunk_index": 1,
"title": "Sicherheitskontrolle 1",
"objective": "Authentifizierung absichern",
"rationale": "Best Practice",
"requirements": ["Req 1"],
"test_procedure": ["Test 1"],
"evidence": ["Nachweis 1"],
"severity": "high",
"tags": ["sicherheit"],
"domain": "SEC",
"category": "it-sicherheit",
"target_audience": ["it-abteilung"],
},
None,
{
"chunk_index": 3,
"title": "Sicherheitskontrolle 3",
"objective": "Netzwerk segmentieren",
"rationale": "Angriffsoberflaeche reduzieren",
"requirements": ["Req 3"],
"test_procedure": ["Test 3"],
"evidence": ["Nachweis 3"],
"severity": "medium",
"tags": ["netzwerk"],
"domain": "NET",
"category": "netzwerksicherheit",
"target_audience": ["it-abteilung"],
},
])
with patch("compliance.services.control_generator._llm_chat", new_callable=AsyncMock) as mock_llm:
mock_llm.return_value = llm_response
controls = await pipeline._reformulate_batch(chunks, config)
assert len(controls) == 3
assert controls[0] is not None
assert controls[1] is None # Null entry from LLM
assert controls[2] is not None
# =============================================================================
# Recital (Erwägungsgrund) Detection Tests
# =============================================================================
class TestRecitalDetection:
"""Tests for _detect_recital — identifying Erwägungsgrund text in source."""
def test_recital_number_detected(self):
"""Text with (126)\\n pattern is flagged as recital suspect."""
text = "Daher ist es wichtig...\n(126)\nDie Konformitätsbewertung sollte..."
result = _detect_recital(text)
assert result is not None
assert result["recital_suspect"] is True
assert "126" in result["recital_numbers"]
def test_multiple_recital_numbers(self):
"""Multiple recital markers are all captured."""
text = "(124)\nErster Punkt.\n(125)\nZweiter Punkt.\n(126)\nDritter Punkt."
result = _detect_recital(text)
assert result is not None
assert "124" in result["recital_numbers"]
assert "125" in result["recital_numbers"]
assert "126" in result["recital_numbers"]
def test_article_text_not_flagged(self):
"""Normal article text without recital markers returns None."""
text = ("Der Anbieter eines Hochrisiko-KI-Systems muss sicherstellen, "
"dass die technische Dokumentation erstellt wird.")
result = _detect_recital(text)
assert result is None
def test_empty_text_returns_none(self):
result = _detect_recital("")
assert result is None
def test_none_text_returns_none(self):
result = _detect_recital(None)
assert result is None
def test_recital_phrases_detected(self):
"""Text with multiple recital-typical phrases is flagged."""
text = ("In Erwägung nachstehender Gründe wurde beschlossen, "
"daher sollte der Anbieter folgende Maßnahmen ergreifen. "
"Es ist daher notwendig, die Konformität sicherzustellen.")
result = _detect_recital(text)
assert result is not None
assert result["detection_method"] == "phrases"
def test_single_phrase_not_enough(self):
"""A single recital phrase alone is not sufficient for detection."""
text = "Daher sollte das System regelmäßig geprüft werden."
result = _detect_recital(text)
assert result is None
def test_combined_regex_and_phrases(self):
"""Both recital numbers and phrases → detection_method is regex+phrases."""
text = "(42)\nIn Erwägung nachstehender Gründe wurde entschieden..."
result = _detect_recital(text)
assert result is not None
assert result["detection_method"] == "regex+phrases"
assert "42" in result["recital_numbers"]
def test_parenthesized_number_without_newline_ignored(self):
"""Numbers in parentheses without trailing newline are not recital markers.
e.g. 'gemäß Absatz (3) des Artikels' should not be flagged."""
text = "Gemäß Absatz (3) des Artikels 52 muss der Anbieter sicherstellen..."
result = _detect_recital(text)
assert result is None
def test_real_world_recital_text(self):
"""Real-world example: AI Act Erwägungsgrund (126) about conformity assessment."""
text = (
"(126)\n"
"Um den Verwaltungsaufwand zu verringern und die Konformitätsbewertung "
"zu vereinfachen, sollten bestimmte Hochrisiko-KI-Systeme, die von "
"Anbietern zertifiziert oder für die eine Konformitätserklärung "
"ausgestellt wurde, automatisch als konform mit den Anforderungen "
"dieser Verordnung gelten, sofern sie den harmonisierten Normen oder "
"gemeinsamen Spezifikationen entsprechen.\n"
"(127)\n"
"Es ist daher angezeigt, dass der Anbieter das entsprechende "
"Konformitätsbewertungsverfahren anwendet."
)
result = _detect_recital(text)
assert result is not None
assert "126" in result["recital_numbers"]
assert "127" in result["recital_numbers"]

View File

@@ -214,13 +214,13 @@ Wenn du z.B. eine neue `GetUserStats()` Funktion im Go Service hinzufuegst:
## Modul-spezifische Tests ## Modul-spezifische Tests
### Canonical Control Generator (82 Tests) ### Canonical Control Generator (81+ Tests)
Die Control Library hat eine umfangreiche Test-Suite ueber 6 Dateien. Die Control Library hat eine umfangreiche Test-Suite ueber 6 Dateien.
Siehe [Canonical Control Library — Tests](../services/sdk-modules/canonical-control-library.md#tests) fuer Details. Siehe [Canonical Control Library — Tests](../services/sdk-modules/canonical-control-library.md#tests) und [Control Generator Pipeline](../services/sdk-modules/control-generator-pipeline.md) fuer Details.
```bash ```bash
# Alle Generator-Tests # Alle Generator-Tests (81 Tests in 12 Klassen)
cd backend-compliance && pytest -v tests/test_control_generator.py cd backend-compliance && pytest -v tests/test_control_generator.py
# Similarity Detector Tests # Similarity Detector Tests
@@ -237,10 +237,20 @@ cd backend-compliance && pytest -v tests/test_validate_controls.py
``` ```
**Wichtig:** Die Generator-Tests nutzen Mocks fuer Anthropic-API und Qdrant — sie laufen ohne externe Abhaengigkeiten. **Wichtig:** Die Generator-Tests nutzen Mocks fuer Anthropic-API und Qdrant — sie laufen ohne externe Abhaengigkeiten.
Die `TestPipelineMocked`-Klasse prueft insbesondere:
- Korrekte Lizenz-Klassifikation (Rule 1/2/3 Verhalten) **Testklassen in `test_control_generator.py`:**
- Rule 3 exponiert **keine** Quellennamen in `generation_metadata`
- SHA-256 Hash-Deduplizierung fuer Chunks | Klasse | Tests | Prueft |
- Config-Defaults (`batch_size: 5`, `skip_processed: true`) |--------|-------|--------|
- Rule 1 Citation wird korrekt mit Gesetzesreferenz generiert | `TestLicenseMapping` | 12 | Lizenz-Klassifikation (Rule 1/2/3), Case-Insensitivitaet |
| `TestDomainDetection` | 5 | Keyword-basierte Domain-Erkennung (AUTH, CRYP, NET, DATA) |
| `TestJsonParsing` | 4 | JSON-Parser fuer LLM-Responses (Markdown-Fencing, Preamble) |
| `TestGeneratedControlRules` | 3 | Rule-spezifische Felder (original_text, citation, source_info) |
| `TestAnchorFinder` | 2 | RAG-Suche + Web-Framework-Erkennung |
| `TestPipelineMocked` | 5 | End-to-End Pipeline mit Mocks (Lizenz, Hash-Dedup, Config) |
| `TestParseJsonArray` | 15 | JSON-Array-Parser (Wrapper-Objekte, Bracket-Extraction, Fallbacks) |
| `TestBatchSizeConfig` | 5 | Batch-Groesse-Konfiguration + Defaults |
| `TestBatchProcessingLoop` | 10 | Batch-Verarbeitung (Rule-Split, Mixed-Rules, Too-Close, Null-Handling) |
| `TestRegulationFilter` | 5 | regulation_filter Prefix-Matching, leere regulation_codes |
| `TestPipelineVersion` | 5 | pipeline_version=2 in DB-Writes, null-Handling in Structure/Reform |
| `TestRecitalDetection` | 10 | Erwaegungsgrund-Erkennung in Quelltexten (Regex, Phrasen, Kombiniert) |

View File

@@ -244,44 +244,36 @@ Der Validator (`scripts/validate-controls.py`) prueft bei jedem Commit:
Automatische Generierung von Controls aus dem gesamten RAG-Korpus (~105.000 Chunks aus Gesetzen, Verordnungen und Standards). Automatische Generierung von Controls aus dem gesamten RAG-Korpus (~105.000 Chunks aus Gesetzen, Verordnungen und Standards).
Aktueller Stand: **~4.738 Controls** generiert. Aktueller Stand: **~4.738 Controls** generiert.
### 9-Stufen-Pipeline !!! tip "Ausfuehrliche Dokumentation"
Siehe **[Control Generator Pipeline](control-generator-pipeline.md)** fuer die vollstaendige Referenz inkl. API-Endpoints, Konfiguration, Kosten und Pipeline-Versionen.
### 7-Stufen-Pipeline (v2)
```mermaid ```mermaid
flowchart TD flowchart TD
A[1. RAG Scroll] -->|max_chunks| B[2. Prefilter - Lokales LLM] A[1. RAG Scan] -->|Alle Chunks laden| B[2. License Classify]
B -->|Irrelevant| C[Als processed markieren] B -->|Rule 1/2| C[3a. Structure Batch]
B -->|Relevant| D[3. License Classify] B -->|Rule 3| D[3b. Reform Batch]
D -->|Batch sammeln| E[4. Batch Processing - 5 Chunks/API-Call] C --> E[4. Harmonize]
E -->|Rule 1/2| F[4a. Structure Batch - Anthropic] D --> E
E -->|Rule 3| G[4b. Reform Batch - Anthropic] E -->|Duplikat| F[Als Duplikat markieren]
F --> QA[5. QA Validation - Lokales LLM] E -->|Neu| G[5. Anchor Search]
G --> QA G --> H[6. Store Control]
QA -->|Mismatch| QAF[Auto-Fix Category/Domain] H --> I[7. Mark Processed]
QA -->|OK| H[6. Harmonization - Embeddings]
QAF --> H
H -->|Duplikat| I[Als Duplikat speichern]
H -->|Neu| J[7. Anchor Search]
J --> K[8. Store Control]
K --> L[9. Mark Processed]
``` ```
### Stufe 1: RAG Scroll (Vollstaendig) !!! info "Pipeline-Version v2 (seit 2026-03-17)"
- **Kein lokaler Vorfilter mehr** — Anthropic API entscheidet selbst ueber Chunk-Relevanz via null-Returns
- **Annexe geschuetzt** — Technische Anforderungen in Anhaengen werden nicht mehr uebersprungen
- **`pipeline_version`** Spalte in DB unterscheidet v1- von v2-Controls
Scrollt durch **ALLE** Chunks in allen RAG-Collections mittels Qdrant Scroll-API. ### Stufe 1: RAG Scan
Kein Limit — jeder Chunk wird verarbeitet, um keine gesetzlichen Anforderungen zu uebersehen.
Scrollt durch **ALLE** Chunks in den konfigurierten RAG-Collections mittels Qdrant Scroll-API.
Optionaler `regulation_filter` beschraenkt auf bestimmte Regulierungen per Prefix-Matching.
Bereits verarbeitete Chunks werden per SHA-256-Hash uebersprungen (`canonical_processed_chunks`). Bereits verarbeitete Chunks werden per SHA-256-Hash uebersprungen (`canonical_processed_chunks`).
### Stufe 2: Lokaler LLM-Vorfilter (Qwen 30B)
**Kostenoptimierung:** Bevor ein Chunk an die Anthropic API geht, prueft das lokale Qwen-Modell (`qwen3:30b-a3b` auf Mac Mini), ob der Chunk eine konkrete Anforderung enthaelt.
- **Relevant:** Pflichten ("muss", "soll"), technische Massnahmen, Datenschutz-Vorgaben
- **Irrelevant:** Definitionen, Inhaltsverzeichnisse, Begriffsbestimmungen, Uebergangsvorschriften
Irrelevante Chunks werden als `prefilter_skip` markiert und nie wieder verarbeitet.
Dies spart >50% der Anthropic-API-Kosten.
### Stufe 3: Lizenz-Klassifikation (3-Regel-System) ### Stufe 3: Lizenz-Klassifikation (3-Regel-System)
| Regel | Lizenz | Original erlaubt? | Beispiel | | Regel | Lizenz | Original erlaubt? | Beispiel |

View File

@@ -0,0 +1,573 @@
# Control Generator Pipeline
Automatische Generierung von Canonical Controls aus dem gesamten RAG-Korpus (~105.000 Chunks aus Gesetzen, Verordnungen und Standards).
**Backend:** `backend-compliance/compliance/services/control_generator.py`
**Routes:** `backend-compliance/compliance/api/control_generator_routes.py`
**API-Prefix:** `/api/compliance/v1/canonical/generate`
---
## Pipeline-Uebersicht
Die Pipeline durchlaeuft 7 Stufen, um aus RAG-Chunks eigenstaendige Security/Compliance Controls zu erzeugen:
```mermaid
flowchart TD
A[1. RAG Scan] -->|Alle Chunks laden| B[2. License Classify]
B -->|Rule 1/2| C[3a. Structure Batch]
B -->|Rule 3| D[3b. Reform Batch]
C --> E[4. Harmonize]
D --> E
E -->|Duplikat| F[Als Duplikat markieren]
E -->|Neu| G[5. Anchor Search]
G --> H[6. Store Control]
H --> I[7. Mark Processed]
```
| Stufe | Name | Beschreibung |
|-------|------|-------------|
| 1 | **RAG Scan** | Laedt unverarbeitete Chunks aus Qdrant (Scroll-API), filtert per SHA-256-Hash |
| 2 | **License Classify** | Bestimmt die Lizenzregel (Rule 1/2/3) anhand `regulation_code` |
| 3a | **Structure (Batch)** | Rule 1+2: Strukturiert Originaltext als Control (Anthropic API) |
| 3b | **Reform (Batch)** | Rule 3: Vollstaendige Reformulierung ohne Originaltext (Anthropic API) |
| 4 | **Harmonize** | Embedding-basierte Duplikaterkennung (bge-m3, Cosine > 0.85) |
| 5 | **Anchor Search** | Findet Open-Source-Referenzen (OWASP, NIST, ENISA) |
| 6 | **Store** | Persistiert Control in `canonical_controls` mit Metadaten |
| 7 | **Mark Processed** | Markiert jeden Chunk als verarbeitet (auch bei Skip/Error/Duplikat) |
---
## Pipeline-Versionen
Die Pipeline hat zwei Versionen. Die Version wird als `pipeline_version` auf `canonical_controls` und `canonical_processed_chunks` gespeichert.
### v1 (Original)
| Eigenschaft | Wert |
|-------------|------|
| **Vorfilter** | Lokales LLM (llama3.2 3B) entscheidet ob Chunk relevant |
| **Anthropic-Prompt** | Alter Prompt ohne null-Skip |
| **Annexe/Anhaenge** | Kein Schutz — wurden haeufig faelschlich als irrelevant uebersprungen |
| **`pipeline_version`** | `1` |
### v2 (Aktuell)
| Eigenschaft | Wert |
|-------------|------|
| **Vorfilter** | Optional (`skip_prefilter`). Wenn aktiviert, entscheidet Anthropic API selbst |
| **Anthropic-Prompt** | Neuer Prompt mit **null-Skip**: API gibt `null` fuer Chunks ohne Anforderung zurueck |
| **Annexe/Anhaenge** | Explizit geschuetzt — Prompt-Anweisung: "Anhaenge/Annexe enthalten oft KONKRETE technische Anforderungen — diese MUESSEN als Control erfasst werden!" |
| **`pipeline_version`** | `2` |
#### Wesentliche Aenderungen v1 → v2
1. **Relevanz-Entscheidung an Anthropic delegiert** — Das lokale LLM (Vorfilter) ist optional. Die Anthropic API entscheidet selbst, welche Chunks Controls enthalten, indem sie `null` fuer irrelevante Chunks zurueckgibt.
2. **null-Skip im JSON-Array** — Das Ergebnis-Array enthaelt `null`-Eintraege fuer Chunks ohne umsetzbare Anforderung. Kein separater Vorfilter-Schritt noetig.
3. **Annexe/Anhaenge geschuetzt** — Explizite Prompt-Anweisung verhindert, dass technische Anforderungen in Anhaengen uebersprungen werden.
#### Datenbank-Feld
```sql
-- Migration 062
ALTER TABLE canonical_controls
ADD COLUMN pipeline_version smallint NOT NULL DEFAULT 1;
ALTER TABLE canonical_processed_chunks
ADD COLUMN pipeline_version smallint NOT NULL DEFAULT 1;
```
Neue Controls erhalten automatisch `pipeline_version = 2`. Bestehende (v1) behalten `1`, damit sie spaeter identifiziert und ggf. reprocessiert werden koennen.
---
## Konfiguration
### Request-Parameter (`GenerateRequest`)
| Parameter | Typ | Default | Beschreibung |
|-----------|-----|---------|-------------|
| `collections` | `List[str]` | Alle 5 Collections | Qdrant-Collections zum Durchsuchen |
| `domain` | `str` | — | Filter auf eine Domain (z.B. `AUTH`, `NET`) |
| `regulation_filter` | `List[str]` | — | Prefix-Matching auf `regulation_code` (z.B. `["eu_2023_1230", "owasp_"]`) |
| `skip_prefilter` | `bool` | `false` | Ueberspringt lokalen LLM-Vorfilter, sendet alle Chunks an die Anthropic API |
| `batch_size` | `int` | `5` | Chunks pro Anthropic-API-Call |
| `max_controls` | `int` | `50` | Maximale Anzahl Controls pro Job (0 = unbegrenzt) |
| `max_chunks` | `int` | `1000` | Maximale Chunks pro Job (0 = unbegrenzt, respektiert Dokumentgrenzen) |
| `skip_web_search` | `bool` | `false` | Ueberspringt Web-Suche in der Anchor-Findung (Stufe 5) |
| `dry_run` | `bool` | `false` | Trockenlauf ohne DB-Schreibzugriffe (synchron, mit Controls im Response) |
!!! info "`regulation_filter` — Prefix-Matching"
Der Filter vergleicht den `regulation_code` jedes Chunks per Prefix.
Beispiel: `["eu_2023_1230"]` erfasst nur Chunks aus der Maschinenverordnung.
`["owasp_"]` erfasst alle OWASP-Dokumente (OWASP ASVS, OWASP SAMM, etc.).
Gross-/Kleinschreibung wird ignoriert.
### Umgebungsvariablen
| Variable | Default | Beschreibung |
|----------|---------|-------------|
| `ANTHROPIC_API_KEY` | — | API-Key fuer Anthropic Claude (Pflicht) |
| `CONTROL_GEN_ANTHROPIC_MODEL` | `claude-sonnet-4-6` | Anthropic-Modell fuer Strukturierung/Reformulierung |
| `OLLAMA_URL` | `http://host.docker.internal:11434` | Lokaler Ollama-Server (Vorfilter + QA) |
| `CONTROL_GEN_OLLAMA_MODEL` | `qwen3.5:35b-a3b` | Lokales LLM-Modell fuer Vorfilter und QA-Arbitrierung |
| `CONTROL_GEN_LLM_TIMEOUT` | `180` | Timeout in Sekunden pro Anthropic-API-Call |
### Pipeline-interne Konstanten
| Konstante | Wert | Beschreibung |
|-----------|------|-------------|
| `PIPELINE_VERSION` | `2` | Aktuelle Pipeline-Version |
| `HARMONIZATION_THRESHOLD` | `0.85` | Cosine-Similarity-Schwelle fuer Duplikaterkennung |
| `max_tokens` | `8192` | Maximale Token-Laenge der LLM-Antwort |
---
## API Endpoints
Alle Endpoints unter `/api/compliance/v1/canonical/`.
### Uebersicht
| Methode | Pfad | Beschreibung |
|---------|------|-------------|
| `POST` | `/generate` | Generierungs-Job starten (laeuft im Hintergrund) |
| `GET` | `/generate/status/{job_id}` | Status eines laufenden Jobs abfragen |
| `GET` | `/generate/jobs` | Alle Jobs auflisten (paginiert) |
| `GET` | `/generate/processed-stats` | Verarbeitungsstatistik pro Collection |
| `GET` | `/generate/review-queue` | Controls zur manuellen Pruefung |
| `POST` | `/generate/review/{control_id}` | Review eines einzelnen Controls abschliessen |
| `POST` | `/generate/bulk-review` | Bulk-Review nach `release_state` |
| `POST` | `/generate/qa-reclassify` | QA-Reklassifizierung bestehender Controls |
| `GET` | `/blocked-sources` | Gesperrte Quellen (Rule 3) auflisten |
| `POST` | `/blocked-sources/cleanup` | Cleanup-Workflow fuer gesperrte Quellen starten |
---
### POST `/v1/canonical/generate` — Job starten
Startet einen Generierungs-Job im Hintergrund. Gibt sofort eine `job_id` zurueck.
**Request:**
```json
{
"collections": ["bp_compliance_gesetze"],
"regulation_filter": ["eu_2023_1230"],
"skip_prefilter": false,
"batch_size": 5,
"max_chunks": 500,
"max_controls": 0,
"skip_web_search": false,
"dry_run": false
}
```
**Response (200):**
```json
{
"job_id": "a1b2c3d4-...",
"status": "running",
"message": "Generation started in background. Poll /generate/status/{job_id} for progress."
}
```
**Beispiel:**
```bash
# Alle Chunks der Maschinenverordnung verarbeiten
curl -X POST https://api-dev.breakpilot.ai/api/compliance/v1/canonical/generate \
-H 'Content-Type: application/json' \
-H 'X-Tenant-ID: 550e8400-e29b-41d4-a716-446655440000' \
-d '{
"collections": ["bp_compliance_ce"],
"regulation_filter": ["eu_2023_1230"],
"max_chunks": 200,
"batch_size": 5
}'
```
```bash
# Dry Run: Keine DB-Aenderungen, Controls im Response
curl -X POST https://api-dev.breakpilot.ai/api/compliance/v1/canonical/generate \
-H 'Content-Type: application/json' \
-H 'X-Tenant-ID: 550e8400-e29b-41d4-a716-446655440000' \
-d '{
"collections": ["bp_compliance_gesetze"],
"max_chunks": 10,
"dry_run": true
}'
```
```bash
# Ohne Vorfilter: Alle Chunks direkt an Anthropic API
curl -X POST https://api-dev.breakpilot.ai/api/compliance/v1/canonical/generate \
-H 'Content-Type: application/json' \
-H 'X-Tenant-ID: 550e8400-e29b-41d4-a716-446655440000' \
-d '{
"collections": ["bp_compliance_gesetze"],
"regulation_filter": ["bdsg"],
"skip_prefilter": true,
"max_chunks": 100
}'
```
!!! warning "Kosten beachten"
Ohne `regulation_filter` und mit `max_chunks: 0` werden **alle** ~105.000 Chunks verarbeitet.
Das verursacht erhebliche Anthropic-API-Kosten (~$700).
---
### GET `/v1/canonical/generate/status/{job_id}` — Job-Status
Gibt den vollstaendigen Status eines Jobs zurueck inkl. Metriken und Fehler.
**Beispiel:**
```bash
curl https://api-dev.breakpilot.ai/api/compliance/v1/canonical/generate/status/a1b2c3d4-... \
-H 'X-Tenant-ID: 550e8400-e29b-41d4-a716-446655440000'
```
**Response:**
```json
{
"id": "a1b2c3d4-...",
"status": "completed",
"total_chunks_scanned": 500,
"controls_generated": 48,
"controls_verified": 45,
"controls_needs_review": 3,
"controls_too_close": 0,
"controls_duplicates_found": 12,
"controls_qa_fixed": 5,
"config": { "..." },
"started_at": "2026-03-17T10:00:00+00:00",
"completed_at": "2026-03-17T10:15:32+00:00"
}
```
---
### GET `/v1/canonical/generate/jobs` — Alle Jobs
Paginierte Liste aller Generierungs-Jobs.
**Query-Parameter:**
| Parameter | Default | Beschreibung |
|-----------|---------|-------------|
| `limit` | `20` | Anzahl Jobs (1-100) |
| `offset` | `0` | Offset fuer Paginierung |
**Beispiel:**
```bash
curl "https://api-dev.breakpilot.ai/api/compliance/v1/canonical/generate/jobs?limit=5" \
-H 'X-Tenant-ID: 550e8400-e29b-41d4-a716-446655440000'
```
---
### GET `/v1/canonical/generate/review-queue` — Review-Queue
Listet Controls auf, die eine manuelle Pruefung benoetigen.
**Query-Parameter:**
| Parameter | Default | Beschreibung |
|-----------|---------|-------------|
| `release_state` | `needs_review` | Filter: `needs_review`, `too_close`, `duplicate` |
| `limit` | `50` | Anzahl (1-200) |
**Beispiel:**
```bash
curl "https://api-dev.breakpilot.ai/api/compliance/v1/canonical/generate/review-queue?release_state=needs_review&limit=10" \
-H 'X-Tenant-ID: 550e8400-e29b-41d4-a716-446655440000'
```
---
### POST `/v1/canonical/generate/review/{control_id}` — Review abschliessen
Schliesst die manuelle Pruefung eines Controls ab.
**Request:**
```json
{
"action": "approve",
"release_state": "draft",
"notes": "Inhaltlich korrekt, Severity passt."
}
```
**Moegliche `action`-Werte:**
| Action | Neuer State | Beschreibung |
|--------|-------------|-------------|
| `approve` | `draft` (oder per `release_state` ueberschreiben) | Control freigeben |
| `reject` | `deprecated` | Control verwerfen |
| `needs_rework` | `needs_review` | Zurueck in die Queue |
**Beispiel:**
```bash
curl -X POST https://api-dev.breakpilot.ai/api/compliance/v1/canonical/generate/review/AUTH-042 \
-H 'Content-Type: application/json' \
-H 'X-Tenant-ID: 550e8400-e29b-41d4-a716-446655440000' \
-d '{"action": "approve", "release_state": "draft"}'
```
---
### POST `/v1/canonical/generate/bulk-review` — Bulk-Review
Aendert den `release_state` aller Controls, die einen bestimmten State haben.
**Request:**
```json
{
"release_state": "needs_review",
"action": "approve",
"new_state": "draft"
}
```
**Beispiel:**
```bash
# Alle needs_review Controls auf draft setzen
curl -X POST https://api-dev.breakpilot.ai/api/compliance/v1/canonical/generate/bulk-review \
-H 'Content-Type: application/json' \
-H 'X-Tenant-ID: 550e8400-e29b-41d4-a716-446655440000' \
-d '{"release_state": "needs_review", "action": "approve", "new_state": "draft"}'
```
---
### GET `/v1/canonical/generate/processed-stats` — Verarbeitungsstatistik
Liefert Statistiken pro RAG-Collection.
**Beispiel:**
```bash
curl https://api-dev.breakpilot.ai/api/compliance/v1/canonical/generate/processed-stats \
-H 'X-Tenant-ID: 550e8400-e29b-41d4-a716-446655440000'
```
**Response:**
```json
{
"stats": [
{
"collection": "bp_compliance_gesetze",
"processed_chunks": 45200,
"direct_adopted": 1850,
"llm_reformed": 120,
"skipped": 43230,
"total_chunks_estimated": 0,
"pending_chunks": 0
}
]
}
```
---
## Kosten und Performance
### Kostenabschaetzung
| Metrik | Wert |
|--------|------|
| **Kosten pro Chunk** | ~$0.0067 (Anthropic API, Batch-Modus) |
| **Yield (Controls/Chunks)** | ~4.5-10% (nur Chunks mit konkreten Anforderungen erzeugen Controls) |
| **Vorfilter-Ersparnis** | ~55% der API-Kosten wenn aktiviert (irrelevante Chunks werden lokal aussortiert) |
### Performance-Kennzahlen
| Metrik | Wert |
|--------|------|
| **Batch-Groesse** | 5 Chunks pro API-Call (Default) |
| **API-Aufrufe Reduktion** | ~80% weniger Aufrufe durch Batching |
| **LLM-Timeout** | 180 Sekunden pro Call |
| **QA-Overhead** | ~2s pro Control (nur bei Disagreement, ~10-15% der Controls) |
### RAG Collections
| Collection | Inhalte | Erwartete Regel |
|-----------|---------|----------------|
| `bp_compliance_gesetze` | Deutsche Gesetze (BDSG, TTDSG, TKG etc.) | Rule 1 |
| `bp_compliance_datenschutz` | Datenschutz-Leitlinien + EU-Verordnungen | Rule 1/2 |
| `bp_compliance_ce` | CE/Sicherheitsstandards | Rule 1/2/3 |
| `bp_dsfa_corpus` | DSFA-Korpus | Rule 1/2 |
| `bp_legal_templates` | Rechtsvorlagen | Rule 1 |
### Aktuelle Groessenordnung
| Metrik | Wert |
|--------|------|
| RAG-Chunks gesamt | ~105.000 (nach Dedup 2026-03-16) |
| Verarbeitete Chunks | ~105.000 |
| Generierte Controls | **~4.738** |
| Konversionsrate | ~4,5% |
---
## Lizenz-Klassifikation (3-Regel-System)
Jeder Chunk wird basierend auf `regulation_code` einer Lizenzregel zugeordnet:
| Regel | Typ | Original erlaubt? | Beispiele |
|-------|-----|-------------------|----------|
| **Rule 1** (free_use) | EU-Gesetze, NIST, DE-Gesetze, Public Domain | Ja | DSGVO, BDSG, NIS2, AI Act |
| **Rule 2** (citation_required) | CC-BY, CC-BY-SA | Ja, mit Zitation | OWASP ASVS, OWASP SAMM |
| **Rule 3** (restricted) | Proprietaer | Nein, volle Reformulierung | BSI TR-03161, ISO 27001 |
### Verarbeitung nach Regel
- **Rule 1+2 → `_structure_batch()`**: Anthropic strukturiert den Originaltext als Control. Ein API-Call fuer den gesamten Batch.
- **Rule 3 → `_reformulate_batch()`**: Anthropic reformuliert vollstaendig — kein Originaltext, keine Quellennamen. Ein API-Call fuer den gesamten Batch.
### Batch Processing
Die Pipeline sammelt Chunks in Batches (Default: 5 Chunks) und sendet sie in einem einzigen Anthropic-API-Call.
1. Relevante Chunks werden mit Lizenz-Info in `pending_batch` gesammelt
2. Bei `batch_size` erreicht → `_flush_batch()`
3. Batch wird nach Lizenzregel getrennt: Rule 1+2 → `_structure_batch()`, Rule 3 → `_reformulate_batch()`
4. Ergebnis: JSON-Array mit genau N Elementen (`null` fuer irrelevante Chunks)
**Fallback:** Bei Batch-Fehler (Timeout, Parsing-Error) wird automatisch auf Einzelverarbeitung zurueckgefallen.
---
## Chunk-Tracking (Processed Chunks)
### Tabelle `canonical_processed_chunks`
| Spalte | Typ | Beschreibung |
|--------|-----|-------------|
| `chunk_hash` | VARCHAR(64) | SHA-256 Hash des Chunk-Textes |
| `collection` | VARCHAR(100) | Qdrant-Collection |
| `regulation_code` | VARCHAR(100) | Quell-Regulation (z.B. `bdsg`, `eu_2016_679`) |
| `license_rule` | INTEGER | 1, 2 oder 3 |
| `processing_path` | VARCHAR(20) | Wie der Chunk verarbeitet wurde |
| `generated_control_ids` | JSONB | UUIDs der generierten Controls |
| `pipeline_version` | SMALLINT | Pipeline-Version (1 oder 2) |
| `job_id` | UUID | Referenz auf den Generierungs-Job |
**UNIQUE Constraint:** `(chunk_hash, collection, document_version)` — verhindert Doppelverarbeitung.
### Processing Paths
| Wert | Stufe | Bedeutung |
|------|-------|-----------|
| `prefilter_skip` | 2 | Lokaler LLM-Vorfilter: Chunk nicht relevant |
| `structured` | 3a | Einzelner Chunk strukturiert (Rule 1/2) |
| `structured_batch` | 3a | Batch-Strukturierung (Rule 1/2) |
| `llm_reform` | 3b | Einzelner Chunk reformuliert (Rule 3) |
| `llm_reform_batch` | 3b | Batch-Reformulierung (Rule 3) |
| `no_control` | 3 | LLM konnte kein Control ableiten (null im Array) |
| `store_failed` | 6 | DB-Speichern fehlgeschlagen |
| `error` | — | Unerwarteter Fehler |
---
## QA Validation (Automatische Qualitaetspruefung)
Die QA-Stufe validiert die Klassifizierung jedes generierten Controls:
1. **LLM-Category:** Anthropic liefert `category` und `domain` im JSON-Response
2. **Keyword-Detection:** `_detect_category(chunk.text)` liefert eine zweite Meinung
3. **Stimmen beide ueberein?** → Schneller Pfad (kein QA noetig)
4. **Bei Disagreement:** Lokales LLM (Ollama) arbitriert
5. **Auto-Fix:** Category/Domain werden automatisch korrigiert
Die QA-Metriken werden in `generation_metadata` gespeichert:
```json
{
"qa_category_fix": {"from": "authentication", "to": "finance", "reason": "IFRS-Thema"},
"qa_domain_fix": {"from": "AUTH", "to": "FIN", "reason": "Finanzregulierung"}
}
```
### Recital-Erkennung (Erwägungsgrund-Detektion)
Die QA-Stufe prueft zusaetzlich, ob der `source_original_text` eines Controls tatsaechlich aus einem Gesetzesartikel stammt — oder aus einem Erwaegungsgrund (Recital). Erwaegungsgruende enthalten keine normativen Pflichten und fuehren zu falsch zugeordneten Controls.
**Erkennungsmethoden:**
| Methode | Pattern | Beispiel |
|---------|---------|----------|
| **Regex** | `\((\d{1,3})\)\s*\n` — Erwaegungsgrund-Nummern | `(126)\nUm den Verwaltungsaufwand...` |
| **Phrasen** | Typische Recital-Formulierungen (≥2 Treffer) | "daher sollte", "in Erwägung nachstehender Gründe" |
**Ergebnis bei Verdacht:**
- `release_state` wird auf `needs_review` gesetzt
- `generation_metadata.recital_suspect = true`
- `generation_metadata.recital_detection` enthaelt Details:
```json
{
"recital_suspect": true,
"recital_detection": {
"recital_suspect": true,
"recital_numbers": ["126", "127"],
"recital_phrases": ["daher sollte"],
"detection_method": "regex+phrases"
}
}
```
**Funktion:** `_detect_recital(text)` in `control_generator.py`
**Hintergrund:** Bei der Analyse von ~5.500 Controls mit Quelltext wurden 1.555 (28%) als Erwaegungsgrund-Verdacht identifiziert. Der Document Crawler unterschied nicht zwischen Artikeltext und Erwaegungsgruenden, was zu falschen `article`/`paragraph`-Zuordnungen fuehrte.
### QA-Reklassifizierung bestehender Controls
```bash
# Dry Run: Welche AUTH-Controls sind falsch klassifiziert?
curl -X POST https://api-dev.breakpilot.ai/api/compliance/v1/canonical/generate/qa-reclassify \
-H 'Content-Type: application/json' \
-H 'X-Tenant-ID: 550e8400-e29b-41d4-a716-446655440000' \
-d '{"limit": 50, "dry_run": true, "filter_domain_prefix": "AUTH"}'
# Korrekturen anwenden:
curl -X POST https://api-dev.breakpilot.ai/api/compliance/v1/canonical/generate/qa-reclassify \
-H 'Content-Type: application/json' \
-H 'X-Tenant-ID: 550e8400-e29b-41d4-a716-446655440000' \
-d '{"limit": 50, "dry_run": false, "filter_domain_prefix": "AUTH"}'
```
---
## Quelldateien
| Datei | Beschreibung |
|-------|-------------|
| `backend-compliance/compliance/services/control_generator.py` | 7-Stufen-Pipeline mit Batch Processing |
| `backend-compliance/compliance/api/control_generator_routes.py` | REST API Endpoints |
| `backend-compliance/compliance/services/license_gate.py` | Lizenz-Gate-Logik |
| `backend-compliance/compliance/services/similarity_detector.py` | Too-Close-Detektor (5 Metriken) |
| `backend-compliance/compliance/services/rag_client.py` | RAG-Client (Qdrant Search + Scroll) |
| `backend-compliance/migrations/046_control_generator.sql` | Job-Tracking, Chunk-Tracking Tabellen |
| `backend-compliance/migrations/048_processing_path_expand.sql` | Erweiterte Processing-Path-Werte |
| `backend-compliance/migrations/062_pipeline_version.sql` | `pipeline_version` Spalte |
| `backend-compliance/tests/test_control_generator.py` | 81+ Tests (Lizenz, Domain, Batch, Pipeline, Recital) |
---
## Verwandte Dokumentation
- [Canonical Control Library (CP-CLIB)](canonical-control-library.md) — Domains, Datenmodell, Too-Close-Detektor, CI/CD Validation
- [Multi-Layer Control Architecture](canonical-control-library.md#multi-layer-control-architecture) — 10-Stage Pipeline-Erweiterung mit Obligations, Patterns, Crosswalk

View File

@@ -103,6 +103,7 @@ nav:
- Dokumentengenerierung: services/sdk-modules/dokumentengenerierung.md - Dokumentengenerierung: services/sdk-modules/dokumentengenerierung.md
- Policy-Bibliothek (29 Richtlinien): services/sdk-modules/policy-bibliothek.md - Policy-Bibliothek (29 Richtlinien): services/sdk-modules/policy-bibliothek.md
- Canonical Control Library (CP-CLIB): services/sdk-modules/canonical-control-library.md - Canonical Control Library (CP-CLIB): services/sdk-modules/canonical-control-library.md
- Control Generator Pipeline: services/sdk-modules/control-generator-pipeline.md
- Strategie: - Strategie:
- Wettbewerbsanalyse & Roadmap: strategy/wettbewerbsanalyse.md - Wettbewerbsanalyse & Roadmap: strategy/wettbewerbsanalyse.md
- Entwicklung: - Entwicklung:

View File

@@ -0,0 +1,79 @@
"""Find controls where source_original_text contains Erwägungsgrund (recital) markers
instead of actual article text — indicates wrong article tagging in RAG chunks."""
import sqlalchemy
import os
import json
import re
url = os.environ.get("DATABASE_URL", "")
if not url:
print("DATABASE_URL not set")
exit(1)
engine = sqlalchemy.create_engine(url)
with engine.connect() as conn:
conn.execute(sqlalchemy.text("SET search_path TO compliance,public"))
r = conn.execute(sqlalchemy.text("""
SELECT control_id, title,
source_citation::text,
source_original_text,
pipeline_version, release_state,
generation_metadata::text
FROM canonical_controls
WHERE source_original_text IS NOT NULL
AND source_original_text != ''
AND source_citation IS NOT NULL
ORDER BY control_id
""")).fetchall()
# Pattern: standalone recital number like (125)\n or (126) at line start
recital_re = re.compile(r'\((\d{1,3})\)\s*\n')
# Pattern: article reference like "Artikel 43" in the text
artikel_re = re.compile(r'Artikel\s+(\d+)', re.IGNORECASE)
suspects_recital = []
suspects_mismatch = []
for row in r:
cid, title, citation_json, orig, pv, state, meta_json = row
if not orig:
continue
citation = json.loads(citation_json) if citation_json else {}
claimed_article = citation.get("article", "")
# Check 1: Recital markers in source text
recital_matches = recital_re.findall(orig)
has_recital = len(recital_matches) > 0
# Check 2: Text mentions a different article than claimed
artikel_matches = artikel_re.findall(orig)
claimed_num = re.search(r'\d+', claimed_article).group() if re.search(r'\d+', claimed_article) else ""
different_articles = [a for a in artikel_matches if a != claimed_num] if claimed_num else []
if has_recital:
suspects_recital.append({
"control_id": cid,
"title": title[:80],
"claimed_article": claimed_article,
"claimed_paragraph": citation.get("paragraph", ""),
"recitals_found": recital_matches[:5],
"v": pv,
"state": state,
})
print(f"=== Ergebnis ===")
print(f"Geprueft: {len(r)} Controls mit source_original_text")
print(f"Erwaegungsgrund-Verdacht: {len(suspects_recital)}")
print()
if suspects_recital:
print(f"{'Control':<12} {'Behauptet':<18} {'Recitals':<20} {'v':>2} {'State':<15} Titel")
print("-" * 120)
for s in suspects_recital:
recitals = ",".join(s["recitals_found"])
print(f"{s['control_id']:<12} {s['claimed_article']:<10} {s['claimed_paragraph']:<7} ({recitals}){'':<{max(0,15-len(recitals))}} v{s['v']} {s['state']:<15} {s['title']}")