feat(pipeline): pipeline_version v2, migration 062, docs + 71 tests

- Add PIPELINE_VERSION=2 constant and pipeline_version column to
  canonical_controls and canonical_processed_chunks (migration 062)
- Anthropic API decides chunk relevance via null-returns (skip_prefilter)
- Annex/appendix chunks explicitly protected in prompts
- Fix 6 failing tests (CRYP domain, _process_batch tuple return)
- Add TestPipelineVersion + TestRegulationFilter test classes (10 new tests)
- Add MkDocs page: control-generator-pipeline.md (541 lines)
- Update canonical-control-library.md with v2 pipeline diagram
- Update testing.md with 71-test breakdown table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-17 17:31:11 +01:00
parent 653aad57e3
commit a9e0869205
7 changed files with 815 additions and 48 deletions

View File

@@ -214,13 +214,13 @@ Wenn du z.B. eine neue `GetUserStats()` Funktion im Go Service hinzufuegst:
## Modul-spezifische Tests
### Canonical Control Generator (82 Tests)
### Canonical Control Generator (71+ Tests)
Die Control Library hat eine umfangreiche Test-Suite ueber 6 Dateien.
Siehe [Canonical Control Library — Tests](../services/sdk-modules/canonical-control-library.md#tests) fuer Details.
Siehe [Canonical Control Library — Tests](../services/sdk-modules/canonical-control-library.md#tests) und [Control Generator Pipeline](../services/sdk-modules/control-generator-pipeline.md) fuer Details.
```bash
# Alle Generator-Tests
# Alle Generator-Tests (71 Tests in 10 Klassen)
cd backend-compliance && pytest -v tests/test_control_generator.py
# Similarity Detector Tests
@@ -237,10 +237,19 @@ cd backend-compliance && pytest -v tests/test_validate_controls.py
```
**Wichtig:** Die Generator-Tests nutzen Mocks fuer Anthropic-API und Qdrant — sie laufen ohne externe Abhaengigkeiten.
Die `TestPipelineMocked`-Klasse prueft insbesondere:
- Korrekte Lizenz-Klassifikation (Rule 1/2/3 Verhalten)
- Rule 3 exponiert **keine** Quellennamen in `generation_metadata`
- SHA-256 Hash-Deduplizierung fuer Chunks
- Config-Defaults (`batch_size: 5`, `skip_processed: true`)
- Rule 1 Citation wird korrekt mit Gesetzesreferenz generiert
**Testklassen in `test_control_generator.py`:**
| Klasse | Tests | Prueft |
|--------|-------|--------|
| `TestLicenseMapping` | 12 | Lizenz-Klassifikation (Rule 1/2/3), Case-Insensitivitaet |
| `TestDomainDetection` | 5 | Keyword-basierte Domain-Erkennung (AUTH, CRYP, NET, DATA) |
| `TestJsonParsing` | 4 | JSON-Parser fuer LLM-Responses (Markdown-Fencing, Preamble) |
| `TestGeneratedControlRules` | 3 | Rule-spezifische Felder (original_text, citation, source_info) |
| `TestAnchorFinder` | 2 | RAG-Suche + Web-Framework-Erkennung |
| `TestPipelineMocked` | 5 | End-to-End Pipeline mit Mocks (Lizenz, Hash-Dedup, Config) |
| `TestParseJsonArray` | 15 | JSON-Array-Parser (Wrapper-Objekte, Bracket-Extraction, Fallbacks) |
| `TestBatchSizeConfig` | 5 | Batch-Groesse-Konfiguration + Defaults |
| `TestBatchProcessingLoop` | 10 | Batch-Verarbeitung (Rule-Split, Mixed-Rules, Too-Close, Null-Handling) |
| `TestRegulationFilter` | 5 | regulation_filter Prefix-Matching, leere regulation_codes |
| `TestPipelineVersion` | 5 | pipeline_version=2 in DB-Writes, null-Handling in Structure/Reform |