breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	e9f368d3ec	feat(ocr-pipeline): add abbreviation allowlist to noise filter Add _KNOWN_ABBREVIATIONS set with ~150 common EN/DE abbreviations (sth, sb, etc, eg, ie, usw, bzw, vgl, adj, adv, prep, sg, pl, ...). Tokens matching known abbreviations are never stripped as noise. Also handle dotted abbreviations (e.g., z.B., i.e.) that have no 2+ consecutive alpha chars by checking the abbreviation set before the _RE_REAL_WORD filter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 10:46:54 +01:00
Benjamin Admin	3028f421b4	feat(ocr-pipeline): add cell text noise filter for OCR artifacts Add _clean_cell_text() with three sub-filters to remove OCR noise: - _is_garbage_text(): vowel/consonant ratio check for phantom row garbage - _is_noise_tail_token(): dictionary-based trailing noise detection - _RE_REAL_WORD check for cells with no real words (just fragments) Handles balanced parentheses "(auf)" and trailing hyphens "under-" as legitimate tokens while stripping noise like "Es)", "3", "ee", "B". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 10:19:31 +01:00
Benjamin Admin	2b1c499d54	fix(ocr-pipeline): filter OCR noise from image areas and artifacts Two generic noise filters added to _ocr_single_cell(): 1. Word confidence filter (conf < 30): removes low-confidence words before text assembly. Catches trailing artifacts like "Es)" after real text, and standalone noise from image edges. 2. Cell noise filter: clears cells whose entire text has no real alphabetic word (>= 2 letters). Catches fragments like "E:", "3", "u", "D", "2.77", "and )" from image areas, while keeping real short words like "Ei", "go", "an". Both filters apply to word-lookup AND cell-OCR fallback results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 09:56:54 +01:00
Benjamin Admin	72cc77dcf4	fix(ocr-pipeline): cells = result, no post-processing content shuffling The cell grid IS the result. Each cell stays at its detected position. Removed _split_comma_entries and _attach_example_sentences from the pipeline — they were shuffling content between rows/columns, causing "Mäuse" to appear in a separate row, "stand..." to move to Example, and "Ei" to disappear. Now: cells → _cells_to_vocab_entries (1:1 row mapping) → _fix_character_confusion → _fix_phonetic_brackets → done. Also lowered pixel-density threshold from 2% to 0.5% for the cell-OCR fallback so small text like "Ei" is not filtered out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 09:41:30 +01:00
Benjamin Admin	e3f939a628	refactor(ocr-pipeline): make post-processing fully generic Three non-generic solutions replaced with universal heuristics: 1. Cell-OCR fallback: instead of restricting to column_en/column_de, now checks pixel density (>2% dark pixels) for ANY column type. Truly empty cells are skipped without running Tesseract. 2. Example-sentence detection: instead of checking for example-column text (worksheet-specific), now uses sentence heuristics (>=4 words or ends with sentence punctuation). Short EN text without DE is kept as a vocab entry (OCR may have missed the translation). 3. Comma-split: re-enabled with singular/plural detection. Pairs like "mouse, mice" / "Maus, Mäuse" are kept together. Verb forms like "break, broke, broken" are still split into individual entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 09:27:30 +01:00
Benjamin Admin	6bca3370e0	fix(ocr-pipeline): fix vocab post-processing destroying correct cell results Three bugs in the post-processing pipeline were overwriting correct streaming results with wrong ones: 1. _split_comma_entries was splitting "Maus, Mäuse" into two separate entries. Disabled — word forms belong together. 2. _attach_example_sentences treated "Ei" (2 chars) as OCR noise due to `len(de) > 2` threshold. Lowered to `len(de) > 1`. 3. _attach_example_sentences wrongly classified rows with EN text but no DE (like "stand ...") as example sentences, merging them into the previous entry. Now only treats rows as examples if they also have no text in the example column. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 09:16:50 +01:00
Benjamin Admin	befc44d2dd	perf(ocr-pipeline): limit cell-OCR fallback to EN/DE columns only Skip Tesseract fallback for column_example cells which are often legitimately empty. This reduces ~48 Tesseract calls to ~10, cutting Step 5 fallback time from ~13s to ~3s. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 09:01:08 +01:00
Benjamin Admin	6db3c02db4	fix(admin-lehrer): force unique build ID to bust browser caches Next.js was producing the same chunk hash across builds, causing browsers to serve stale cached JS even after redeployment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 08:54:05 +01:00
Benjamin Admin	8f2c2e8f68	feat(ocr-pipeline): hybrid word-lookup with cell-OCR fallback Word-lookup from full-page Tesseract is fast but can miss small or isolated words (e.g. "Ei"). Now falls back to per-cell Tesseract OCR for cells that remain empty after word-lookup. The ocr_engine field reports 'cell_ocr_fallback' for cells that needed the fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 08:21:12 +01:00
Benjamin Admin	50ad06f43a	fix(ocr-pipeline): always run fresh word detection, skip stale cache Word-lookup is now ~0.03s (vs seconds with per-cell Tesseract), so always re-run detection when entering Step 5 instead of showing potentially stale cached word_result from the session DB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 08:05:13 +01:00
Benjamin Admin	2c4160e4c4	fix(ocr-pipeline): exclusive word-to-column assignment prevents duplicates Replace per-cell word filtering (which allowed the same word to appear in multiple columns due to padded overlap) with exclusive nearest-center assignment. Each word is assigned to exactly one column per row. Also use row height as Y-tolerance for text assembly so words within the same row (e.g. "Maus, Mäuse") are always grouped on one line. Fixes: words leaking into wrong columns, missing words, duplicate words. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 07:54:45 +01:00
Benjamin Admin	9bbde1c03e	fix(ocr-pipeline): re-populate row.words for word-lookup in Step 5 The row_result stored in DB excludes words to keep payload small. When Step 5 reconstructs RowGeometry from DB, words were empty, causing word-lookup to find nothing and return blank cells. Now re-populates row.words from cached _word_dicts (or re-runs detect_column_geometry if cache is cold) before cell grid building. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 07:38:33 +01:00
Benjamin Admin	77869e32f4	feat(ocr-pipeline): use word-lookup instead of cell-OCR for cell grid Replace per-cell Tesseract re-runs with lookup of pre-existing full-page words from row.words. Words are filtered by X-overlap with column bounds. This fixes phantom rows with garbage text, missing last words, and incomplete example text by using the more reliable full-page OCR results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 07:24:46 +01:00
Benjamin Admin	89b5f49918	fix(ocr-pipeline): filter phantom rows with word_count=0 from cell grid Rows in inter-line whitespace gaps have no Tesseract words assigned but were still processed by build_cell_grid, producing garbage OCR output. Filter these phantom rows using the word_count field set during Step 4. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 18:40:13 +01:00
Benjamin Admin	7f27783008	feat(ocr-pipeline): add SSE streaming for word recognition (Step 5) Cells now appear one-by-one in the UI as they are OCR'd, with a live progress bar, instead of waiting for the full result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 17:54:20 +01:00
Benjamin Admin	a666e883da	fix(ocr-pipeline): exclude header/footer/page_ref from cell grid columns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 17:33:48 +01:00
Benjamin Admin	27b895a848	feat(ocr-pipeline): generic cell-grid with optional vocab mapping Extract build_cell_grid() as layout-agnostic foundation from build_word_grid(). Step 5 now produces a generic cell grid (columns x rows) and auto-detects whether vocab layout is present. Frontend dynamically switches between vocab table (EN/DE/Example) and generic cell table based on layout type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 17:22:56 +01:00
Benjamin Admin	3bcb7aa638	fix(ocr-pipeline): remove overzealous grid row count validation The validation that rejected word-center grid when it produced more rows than gap-based detection was causing fallback to gap-based rows (large boxes). The word-center grid regularization works correctly after the center-based grouping and cluster merging fixes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 13:01:27 +01:00
Benjamin Admin	c4f2e6554e	fix(ocr-pipeline): prevent grid from producing more rows than gap-based Two fixes: 1. Grid validation: reject word-center grid if it produces MORE rows than gap-based detection (more rows = lines were split = worse). Falls back to gap-based rows in that case. 2. Words overlay: draw clean grid cells (column × row intersections) instead of padded entry bboxes. Eliminates confusing double lines. OCR text labels are placed inside the grid cells directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 12:52:41 +01:00
Benjamin Admin	8e861e5a4d	fix(ocr-pipeline): use gap-based row height for cluster tolerance The y_tolerance for word-center clustering was based on median word height (21px → 12px tolerance), which was too small. Words on the same line can have centers 15-20px apart due to different heights. Now uses 40% of the gap-based median row height as tolerance (e.g. 40px row → 16px tolerance), and 30% for merge threshold. This produces correct cluster counts matching actual text lines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 12:34:15 +01:00
Benjamin Admin	4970ca903e	fix(ocr-pipeline): invalidate downstream results when steps are re-run When columns change (Step 3), invalidate row_result and word_result. When rows change (Step 4), invalidate word_result. This ensures Step 5 always uses the latest row boundaries instead of showing stale cached word_result from a previous run. Applies to both auto-detection and manual override endpoints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 12:24:44 +01:00
Benjamin Admin	97d4355aa9	fix(ocr-pipeline): group words by vertical center, merge close clusters Fix half-height rows caused by tall special characters (brackets, IPA symbols) being split into separate line clusters: - Group words by vertical CENTER instead of TOP position, so tall characters on the same line stay in one cluster - Filter outlier-height words (>2× median) when computing letter_h so brackets/IPA don't skew the row height - Merge clusters closer than 0.4× median word height (definitely same text line despite slight center differences) - Increased y_tolerance from 0.5× to 0.6× median word height - Enhanced logging with cluster merge count and row height range Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 12:14:42 +01:00
Benjamin Admin	8ad5823fd8	feat(ocr-pipeline): word-center grid with section-break detection Replace rigid uniform grid with bottom-up approach that derives row boundaries from word vertical centers: - Group words into line clusters, compute center_y per cluster - Compute pitch (distance between consecutive centers) - Detect section breaks where gap > 1.8× median pitch - Place row boundaries at midpoints between consecutive centers - Per-section local pitch adapts to heading/paragraph spacing - Validate ≥85% word placement, fallback to gap-based rows Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 12:04:08 +01:00
Benjamin Admin	ec47045c15	feat(ocr-pipeline): uniform grid regularization for row detection (Step 7) Replace _split_oversized_rows() with _regularize_row_grid(). When ≥60% of content rows have consistent height (±25% of median), overlay a uniform grid with the standard row height over the entire content area. This leverages the fact that books/vocab lists use constant row heights. Validates grid by checking ≥85% of words land in a grid row. Falls back to gap-based rows if heights are too irregular or words don't fit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 11:50:50 +01:00
Benjamin Admin	ba65e47654	feat(ocr-pipeline): move oversized row splitting from Step 5 to Step 4 Implement _split_oversized_rows() in detect_row_geometry() (Step 7) to split content rows >1.5× median height using local horizontal projection. This produces correctly-sized rows before word OCR runs, instead of working around the issue in Step 5 with sub-cell splitting hacks. Removed Step 5 workarounds: _split_oversized_entries(), sub-cell splitting in build_word_grid(), and median_row_h calculation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 11:46:18 +01:00
Benjamin Admin	8507e2e035	fix(ocr-pipeline): split oversized cells before OCR to capture all text For cells taller than 1.5× median row height, split vertically into sub-cells and OCR each separately. This fixes RapidOCR losing text at the bottom of tall cells (e.g. "floor/Fußboden" below "egg/Ei" in a merged row). Generic fix — works for any oversized cell. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 11:32:10 +01:00
Benjamin Admin	854d8b431b	feat(rag-qa): add 14 missing PDF mappings for EDPB, ENISA, EDPS, TMG, UrhG Adds entries for all regulation codes in REGULATIONS_IN_RAG that were missing from RAG_PDF_MAPPING, fixing "Kein PDF-Mapping" messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 11:10:09 +01:00
Benjamin Admin	f2521d2b9e	feat(ocr-pipeline): British/American IPA pronunciation choice - Integrate Britfone dictionary (MIT, 15k British English IPA entries) - Add pronunciation parameter: 'british' (default) or 'american' - British uses Britfone (Received Pronunciation), falls back to CMU - American uses eng_to_ipa/CMU, falls back to Britfone - Frontend: dropdown to switch pronunciation, default = British - API: ?pronunciation=british\|american query parameter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 11:08:52 +01:00
Benjamin Admin	954d21e469	fix: use local Inter font to avoid Google Fonts timeout in Docker build The Docker container cannot reach Google Fonts, causing build failures. Switch to bundled local font file using next/font/local. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:26:34 +01:00
Benjamin Admin	010616be5a	fix(ocr-pipeline): generic example attachment + cell padding 1. Semantic example matching: instead of attaching example sentences to the immediately preceding entry, find the vocab entry whose English word(s) appear in the example. "a broken arm" → matches "broken" via word overlap, not "egg/Ei". Uses stem matching for word form variants (break/broken share stem "bro"). 2. Cell padding: add 8px padding to each cell region so words at column/row edges don't get clipped by OCR (fixes "er wollte" missing at cell boundaries). 3. Treat very short DE text (≤2 chars) as OCR noise, not real translation — prevents false positives in example detection. All fixes are generic and deterministic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:24:28 +01:00
Benjamin Admin	e3aa8e899e	feat(rag-qa): add fullscreen mode for split-view chunk browser Allows viewing chunks side-by-side with original PDF in fullscreen mode for large screen QA review. Toggle via button or close with Escape key. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:23:32 +01:00
Benjamin Admin	266b9dfad3	Fix PDF 404: default to bp_compliance_ce collection, add PDF existence check Default collection changed from bp_compliance_gesetze (DE/AT/CH laws where PDFs need manual download) to bp_compliance_ce (EU regulations where PDFs are auto-downloaded). Added HEAD request check so missing PDFs show a clear "PDF nicht vorhanden" message instead of a 404 in the iframe. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:13:26 +01:00
Benjamin Admin	ab294d5a6f	feat(ocr-pipeline): deterministic post-processing pipeline Add 4 post-processing steps after OCR (no LLM needed): 1. Character confusion fix: I/1/l/\| correction using cross-language context (if DE has "Ich", EN "1" → "I") 2. IPA dictionary replacement: detect [phonetics] brackets, look up correct IPA from eng_to_ipa (MIT, 134k words) — replaces OCR'd phonetic symbols with dictionary-correct transcription 3. Comma-split: "break, broke, broken" / "brechen, brach, gebrochen" → 3 individual entries when part counts match 4. Example sentence attachment: rows with EN but no DE translation get attached as examples to the preceding vocab entry All fixes are deterministic and generic — no hardcoded word lists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:00:09 +01:00
Benjamin Admin	b48cd8bb46	Fix ChunkBrowserQA layout: proper height constraints, remove bottom nav duplication - Root container uses calc(100vh - 220px) for fixed viewport height - All flex children use min-h-0 to enable proper overflow scrolling - Removed duplicate bottom nav buttons (Zurueck/Weiter) that appeared in the middle of the chunk text — navigation is only in the header now - Chunk text panel scrolls internally with fixed header - Added prominent article/section badges in header and panel header - Added chunk length quality indicator (warns on very short/long chunks) - Structural metadata keys (article, section, pages) sorted first - Sidebar shows regulation name instead of code for better readability - PDF viewer uses pages metadata from payload when available Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 20:24:50 +01:00
Benjamin Admin	d481e0087b	deps: add eng-to-ipa for IPA dictionary lookup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 20:23:40 +01:00
Benjamin Admin	f7e0f2bb4f	feat(ocr-pipeline): line breaks, hyphen rejoin & oversized row splitting - Preserve \n between visual lines within cells (instead of joining with space) - Rejoin hyphenated words split across line breaks (e.g. Fuß-\nboden → Fußboden) - Split oversized rows (>1.5× median height) into sub-entries when EN/DE line counts match — deterministic fix for missed Step 4 row boundaries - Frontend: render \n as <br/>, use textarea for multiline editing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 18:49:28 +01:00
Benjamin Admin	e7fb9d59f1	Fix ChunkBrowserQA: use regulation_id from Qdrant payload instead of regulation_code The Qdrant collections use regulation_id (e.g. eu_2016_679) as the filter key, not regulation_code (e.g. GDPR). Updated rag-constants.ts with correct qdrant_id mappings from actual Qdrant data, fixed API to filter on regulation_id, and updated ChunkBrowserQA to pass qdrant_id values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 18:22:12 +01:00
Benjamin Admin	859342300e	fix(ocr-pipeline): configure RapidOCR for German + tighter word detection - Switch to PP-OCRv5 Latin model (supports ä, ö, ü, ß) - Use SERVER model for better accuracy - Lower Det.unclip_ratio 1.6→1.3 to reduce word merging - Raise Det.box_thresh 0.5→0.6 for stricter detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 18:17:49 +01:00
Benjamin Admin	8c42fefa77	feat(rag): add QA Split-View Chunk-Browser for ingestion verification New ChunkBrowserQA component replaces inline chunk browser with: - Document sidebar with live chunk counts per regulation (batched Qdrant count API) - Sequential chunk navigation with arrow keys (1/N through all chunks of a document) - Overlap display showing previous/next chunk boundaries (amber-highlighted) - Split-view with original PDF via iframe (estimated page from chunk index) - Adjustable chunks-per-page ratio for PDF page estimation Extracts REGULATIONS_IN_RAG and REGULATION_INFO to shared rag-constants.ts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 17:46:11 +01:00
Benjamin Admin	984dfab975	fix(ocr-pipeline): add libgl1 for RapidOCR OpenCV dependency RapidOCR pulls in full opencv-python which requires libGL.so.1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 17:30:12 +01:00
Benjamin Admin	45435f226f	feat(ocr-pipeline): line grouping fix + RapidOCR integration Fix A: Use _group_words_into_lines() with adaptive Y-tolerance to correctly order words in multi-line cells (fixes word reordering bug). RapidOCR: Add as alternative OCR engine (PaddleOCR models on ONNX Runtime, native ARM64). Engine selectable via dropdown in UI or ?engine= query param. Auto mode prefers RapidOCR when available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 17:13:58 +01:00
Benjamin Admin	4ec7c20490	feat(ocr-pipeline): add rapidocr + onnxruntime to requirements RapidOCR uses PaddleOCR models on ONNX Runtime, works natively on ARM64. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 17:08:21 +01:00
Benjamin Admin	17604b8eb2	test: add tests for API proxy scroll/collection-count and Chunk-Browser logic CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m41s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 19s Details 42 tests covering: - Qdrant scroll endpoint proxy (offset, limit, filters, text search) - Collection-count endpoint - REGULATION_SOURCES URL validation (IFRS, EFRAG, ENISA, NIST, OECD) - Chunk-Browser collections, text search filtering, pagination state Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 16:46:42 +01:00
Benjamin Admin	f39314fb27	docs: add Chunk-Browser documentation - Document Chunk-Browser tab functionality and API - Cover scroll endpoint, text search, pagination - Document Originalquelle links and low-chunk warnings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 09:50:36 +01:00
Benjamin Admin	356d39d6ee	fix(ocr-pipeline): use PSM 6 (block) for multi-line cell OCR in word grid PSM 7 (single line) missed the second line in cells with two lines. PSM 6 handles multi-line content. Also fix sort order to Y-then-X for correct reading order. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 09:40:04 +01:00
Benjamin Admin	491df4e1b0	feat: add Chunk-Browser tab to RAG page - New 'Chunk-Browser' tab for sequential chunk browsing - Qdrant scroll API proxy (scroll + collection-count actions) - Pagination with prev/next through all chunks in a collection - Text search filter with highlighting - Click to expand chunk and see all metadata - 'In Chunks suchen' button now navigates to Chunk-Browser with correct collection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 09:35:52 +01:00
Benjamin Admin	954103cdf2	feat(ocr-pipeline): add Step 5 word recognition (grid from columns × rows) Backend: build_word_grid() intersects column regions with content rows, OCRs each cell with language-specific Tesseract, and returns vocabulary entries with percent-based bounding boxes. New endpoints: POST /words, GET /image/words-overlay, ground-truth save/retrieve for words. Frontend: StepWordRecognition with overview + step-through labeling modes, goToStep callback for row correction feedback loop. MkDocs: OCR Pipeline documentation added. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 02:18:29 +01:00
Benjamin Admin	47dc2e6f7a	feat(rag): source URLs, low-chunk warnings & IFRS/EFRAG entries - Add REGULATION_SOURCES map with 88 original document URLs for all regulations (EUR-Lex, gesetze-im-internet.de, RIS, Fedlex, etc.) - Render "Originalquelle →" link in regulation detail panel - Add amber warning indicator for suspiciously low chunk counts (<10) - Add EU_IFRS_DE, EU_IFRS_EN, EFRAG_ENDORSEMENT to RAG tracking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 01:56:09 +01:00
Benjamin Admin	203b3c0e2d	fix(ocr-pipeline): mask out images in row detection horizontal projection Build a word-coverage mask so only pixels near Tesseract word bounding boxes contribute to the horizontal projection. Image regions (high ink but no words) are treated as white, preventing illustrations from merging multiple vocabulary rows into one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 01:39:20 +01:00
Benjamin Admin	b58aecd081	feat(ocr-pipeline): add Step 4 row detection UI in admin frontend Insert rows step between columns and words in the pipeline wizard. Shows overlay image, row list with type badges, and ground truth controls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 01:28:05 +01:00

1 2 3 4

197 Commits