feat: Document-centric scan results + DSI deduplication

DSI Dedup (consent-tester):
- Only H1/H2 headings count as documents (not H3/H4 sub-sections)
- Sub-sections (Cookies, Betroffenenrechte, Social Media) are part of
  parent document's full text, not separate documents
- Reduces IHK result from 30 to ~11 real documents

Backend (agent_scan_routes):
- ScanFinding gets doc_title field linking each finding to its document
- doc_title set when creating DSI findings for document attribution

Frontend (ScanResult.tsx):
- 3 sections: Services table, Document cards, General findings
- Documents: expandable cards with completeness bar (green/yellow/red)
- Findings grouped under their parent document
- Each card shows: title, word count, findings count, % completeness
- Findings without doc_title go to "Allgemeine Findings" section

Email Summary (agent_scan_helpers):
- Findings listed under their parent document
- General findings in separate section
- No more flat mixed list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-05 09:56:29 +02:00
parent d816cf8d3a
commit 7c7513525e
4 changed files with 210 additions and 91 deletions
+12 -6
View File
@@ -444,13 +444,18 @@ async def _expand_all_interactive(page: Page) -> None:
async def _find_inline_dsi_sections(page: Page) -> list[dict]:
"""Find DSI content already visible on the page (e.g. expanded accordions)."""
"""Find DSI content already visible on the page (e.g. expanded accordions).
Only counts top-level documents (H1/H2 with DSI keywords).
Sub-sections (H3/H4 like 'Cookies', 'Betroffenenrechte') are NOT counted
as separate documents — their text is part of the parent document.
"""
try:
sections = await page.evaluate("""
() => {
const results = [];
// Find headings that match DSI keywords
const headings = document.querySelectorAll('h1, h2, h3, h4, h5');
// Only H1 and H2 count as document-level headings
const headings = document.querySelectorAll('h1, h2');
const dsiKeywords = [
'datenschutz', 'privacy', 'données', 'privacidad', 'protezione',
'gegevensbescherming', 'ochrona danych', 'tietosuoja', 'integritet',
@@ -461,12 +466,13 @@ async def _find_inline_dsi_sections(page: Page) -> list[dict]:
const textLower = text.toLowerCase();
if (!dsiKeywords.some(kw => textLower.includes(kw))) continue;
// Get the section content following this heading
// Get ALL content until the next H1/H2 (include sub-sections H3-H5)
let content = '';
let el = h.nextElementSibling;
let count = 0;
while (el && count < 50) {
if (el.tagName.match(/^H[1-5]$/)) break;
while (el && count < 200) {
// Stop at next H1 or H2 (next top-level document)
if (el.tagName === 'H1' || el.tagName === 'H2') break;
content += (el.textContent || '').trim() + '\\n';
el = el.nextElementSibling;
count++;