feat: Document-centric scan results + DSI deduplication
DSI Dedup (consent-tester): - Only H1/H2 headings count as documents (not H3/H4 sub-sections) - Sub-sections (Cookies, Betroffenenrechte, Social Media) are part of parent document's full text, not separate documents - Reduces IHK result from 30 to ~11 real documents Backend (agent_scan_routes): - ScanFinding gets doc_title field linking each finding to its document - doc_title set when creating DSI findings for document attribution Frontend (ScanResult.tsx): - 3 sections: Services table, Document cards, General findings - Documents: expandable cards with completeness bar (green/yellow/red) - Findings grouped under their parent document - Each card shows: title, word count, findings count, % completeness - Findings without doc_title go to "Allgemeine Findings" section Email Summary (agent_scan_helpers): - Findings listed under their parent document - General findings in separate section - No more flat mixed list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -444,13 +444,18 @@ async def _expand_all_interactive(page: Page) -> None:
|
||||
|
||||
|
||||
async def _find_inline_dsi_sections(page: Page) -> list[dict]:
|
||||
"""Find DSI content already visible on the page (e.g. expanded accordions)."""
|
||||
"""Find DSI content already visible on the page (e.g. expanded accordions).
|
||||
|
||||
Only counts top-level documents (H1/H2 with DSI keywords).
|
||||
Sub-sections (H3/H4 like 'Cookies', 'Betroffenenrechte') are NOT counted
|
||||
as separate documents — their text is part of the parent document.
|
||||
"""
|
||||
try:
|
||||
sections = await page.evaluate("""
|
||||
() => {
|
||||
const results = [];
|
||||
// Find headings that match DSI keywords
|
||||
const headings = document.querySelectorAll('h1, h2, h3, h4, h5');
|
||||
// Only H1 and H2 count as document-level headings
|
||||
const headings = document.querySelectorAll('h1, h2');
|
||||
const dsiKeywords = [
|
||||
'datenschutz', 'privacy', 'données', 'privacidad', 'protezione',
|
||||
'gegevensbescherming', 'ochrona danych', 'tietosuoja', 'integritet',
|
||||
@@ -461,12 +466,13 @@ async def _find_inline_dsi_sections(page: Page) -> list[dict]:
|
||||
const textLower = text.toLowerCase();
|
||||
if (!dsiKeywords.some(kw => textLower.includes(kw))) continue;
|
||||
|
||||
// Get the section content following this heading
|
||||
// Get ALL content until the next H1/H2 (include sub-sections H3-H5)
|
||||
let content = '';
|
||||
let el = h.nextElementSibling;
|
||||
let count = 0;
|
||||
while (el && count < 50) {
|
||||
if (el.tagName.match(/^H[1-5]$/)) break;
|
||||
while (el && count < 200) {
|
||||
// Stop at next H1 or H2 (next top-level document)
|
||||
if (el.tagName === 'H1' || el.tagName === 'H2') break;
|
||||
content += (el.textContent || '').trim() + '\\n';
|
||||
el = el.nextElementSibling;
|
||||
count++;
|
||||
|
||||
Reference in New Issue
Block a user