breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	ea09fc75df	fix: resolve circular import with lazy import for _build_reference_snapshot Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 13:18:21 +01:00
Benjamin Admin	410d36f3de	feat: save automatic grid snapshot before manual edits for GT comparison - build-grid now saves the automatic OCR result as ground_truth.auto_grid_snapshot - mark-ground-truth includes a correction_diff comparing auto vs corrected - New endpoint GET /correction-diff returns detailed diff with per-col_type accuracy breakdown (english, german, ipa, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 13:16:44 +01:00
Benjamin Admin	40815dafd1	feat(ocr-pipeline): add page-split endpoint for double-page book spreads Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 20s Details Each page of a double-page scan tilts differently due to the book spine. The new POST /page-split endpoint detects spreads after orientation and creates sub-sessions that go through the full pipeline (deskew, dewarp, crop, etc.) individually, so each page gets its own deskew correction. Also fixes border-strip filter incorrectly removing German translation words by adding a decorative-strip validation check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 10:53:06 +01:00
Benjamin Admin	65f4ce1947	feat: ImageLayoutEditor, arrow-key nav, multi-select bold, wider columns Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 32s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 18s Details - New ImageLayoutEditor: SVG overlay on original scan with draggable column dividers, horizontal guidelines (margins/header/footer), double-click to add columns, x-button to delete - GridTable: MIN_COL_WIDTH 40→80px for better readability - Arrow up/down keys navigate between rows in the grid editor - Ctrl+Click for multi-cell selection, Ctrl+B to toggle bold on selection - getAdjacentCell works for cells that don't exist yet (new rows/cols) - deleteColumn now merges x-boundaries correctly - Session restore fix: grid_editor_result/structure_result in session GET - Footer row 3-state cycle, auto-create cells for empty footer rows - Grid save/build/GT-mark now advance current_step=11 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 07:45:39 +01:00
Benjamin Admin	4a44ad7986	fix: hard-filter OCR words inside detected graphic regions Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m51s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 16s Details Run detect_graphic_elements() in the grid pipeline after image loading and remove ALL words whose centroids fall inside detected graphic regions, regardless of confidence. Previously only low-confidence words (conf < 50) were removed, letting artifacts like "Tr", "Su" survive. Changes: - grid_editor_api.py: Import and call detect_graphic_elements() at Step 3a, passing only significant words (len >= 3) to avoid short artifacts fooling the text-vs-graphic heuristic. Hard-filter all words in graphic regions. - cv_graphic_detect.py: Lower density threshold from 20% to 5% for large regions (>100x80px) — photos/illustrations have low color saturation. Raise page-spanning limit from 50% to 60% width/height. Tested: 5 ground-truth sessions pass regression (079cd0d9, d8533a2c, 2838c7a7, 4233d7e3, 5997b635). Session 5997 now detects 2 graphic regions and removes 29 artifact words including "Tr" and "Su". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 10:18:23 +01:00
Benjamin Admin	7b3319be2e	fix: merge syllable-split word_boxes + keep dictionary guide words Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details OCR splits words at syllable marks into overlapping word_boxes (e.g. "zu" + "tiefst" with 52% x-overlap). Step 5i previously removed the lower-confidence box, losing the prefix. Now: when both boxes are alphabetic text with 20-75% overlap, MERGE them into one word_box ("zutiefst") instead of removing. Also relaxed artifact cell filter: 2-char alphabetic text like "Zw" (dictionary guide word) is no longer removed. Only non-alphabetic short text like "a=" is filtered. Results for session 5997: "tiefst"→"zutiefst", "zu"→"zuständig", "Zu die Zuschüsse"→"Zuschuss, die Zuschüsse", "Zw" restored. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 08:21:00 +01:00
Benjamin Admin	882b177fc3	fix: remove image-area artifacts + fix heading false positive for dictionary entries Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details Three fixes for dictionary page session 5997: 1. Heading detection: column_1 cells with article words (die/der/das) now count as content cells, preventing "die Zuschrift, die Zuschriften" from being falsely merged into a spanning heading cell. 2. Step 5j-pre: new artifact cell filter removes short garbled text from OCR on image areas (e.g. "7 EN", "Tr", "\\", "PEE", "a="). Cells survive earlier filters because their rows have real content in other columns. Also cleans up empty rows after removal. 3. Footer "PEE" auto-fixed: artifact filter removes the noise cell, empty row gets cleaned up, footer detection no longer sees it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 07:59:24 +01:00
Benjamin Admin	1fae39dbb8	fix: lower secondary column threshold + strip pipe chars from word_boxes Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 35s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m57s Details CI / test-python-agent-core (push) Successful in 21s Details CI / test-nodejs-website (push) Successful in 18s Details Dictionary pages have 2 dictionary columns, each with article + headword sub-columns. The right article column (die/der at x≈626) had only 14.3% row coverage — below the 20% secondary threshold. Lowered to 12% so dictionary article columns qualify. Also strip pipe characters from individual word_box text (not just cell text) to remove OCR syllable separation marks (e.g. "zu\|trau\|en" → "zutrauen"). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 07:44:03 +01:00
Benjamin Admin	46c8c28d34	fix: border strip pre-filter + 3-column detection for vocabulary tables The border strip filter (Step 4e) used the LARGEST x-gap which incorrectly removed base words along with edge artifacts. Now uses a two-stage approach: 1. _filter_border_strip_words() pre-filters raw words BEFORE column detection, scanning from the page edge inward to find the FIRST significant gap (>30px) 2. Step 4e runs as fallback only when pre-filter didn't apply Session 4233 now correctly detects 3 columns (base word \| oder \| synonyms) instead of 2. Threshold raised from 15% to 20% to handle pages with many edge artifacts. All 4 ground-truth sessions pass regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 21:01:43 +01:00
Benjamin Admin	4000110501	fix: extend tiny symbol filter to all non-black colors, raise area to 200 Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m49s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details Step 5i rule (a) only caught blue tiny symbols. Graphic fragments from page illustrations (e.g. orange quote mark from man illustration) were missed. Now filters any non-black colored word_box with area < 200 and confidence < 85. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 18:05:31 +01:00
Benjamin Admin	c0e1118870	feat: detect and remove page-border decoration strip artifacts (Step 4e) Textbooks with decorative alphabet strips along page edges produce OCR artifacts (scattered colored letters at x<150 while real content starts at x>=179). Step 4e detects a significant x-gap (>30px) between a small cluster (<15% of total word_boxes) near the page edge and the main content, then removes the border-strip word_boxes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 17:20:45 +01:00
Benjamin Admin	f31a7175a2	fix: normalize word_box order to reading order for frontend display (Step 5j) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details The frontend renders colored cells from the word_boxes array order, not from cell.text. After post-processing steps (5i bullet removal etc), word_boxes could remain in their original insertion order instead of left-to-right reading order. Step 5j now explicitly sorts word_boxes using _group_words_into_lines before the result is built. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 19:21:37 +01:00
Benjamin Admin	bacbfd88f1	Fix word ordering in cell text rebuild (Steps 4c, 4d, 5i) Cell text was rebuilt using naive (top, left) sorting after removing word_boxes in Steps 4c/4d/5i. This produced wrong word order when words on the same visual line had slightly different top values (1-6px). Now uses _words_to_reading_order_text() which groups words into visual lines by y-tolerance before sorting by x within each line, matching the initial cell text construction in _build_cells. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 18:45:33 +01:00
Benjamin Admin	2c63beff04	Fix bullet overlap disambiguation + raise red threshold to 90 Step 5i: For word_boxes with >90% x-overlap and different text, use IPA dictionary to decide which to keep (e.g. "tightly" in dict, "fighily" not). Red threshold raised from 80 to 90 to catch remaining scanner artifacts like "tight" and "5" that were still misclassified as red. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 18:21:00 +01:00
Benjamin Admin	82433b4bad	Step 5i: Remove blue bullet/artifact and overlapping duplicate word_boxes Dictionary pages have small blue square bullets before entries that OCR reads as text artifacts. Three detection rules: a) Tiny blue symbols (area < 150, conf < 85): catches ©, e, * etc. b) X-overlapping word_boxes (>40%): remove lower confidence one c) Duplicate blue text with gap < 6px: remove one copy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 18:17:07 +01:00
Benjamin Admin	45b83560fd	Vertical zone split: detect divider lines and create independent sub-zones Pages with two side-by-side vocabulary columns separated by a vertical black line are now split into independent sub-zones before row/column detection. Each sub-zone gets its own rows, preventing misalignment from different heading rhythms. - _detect_vertical_dividers(): finds pipe word_boxes at consistent x positions spanning >50% of zone height - _split_zone_at_vertical_dividers(): creates left/right PageZone objects with layout_hint and vsplit_group metadata - Column union skips vsplit zones (independent column sets) - Frontend renders vsplit zones side by side via flex layout - PageZone gets layout_hint + vsplit_group fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 16:38:12 +01:00
Benjamin Admin	76ba83eecb	Tighten tertiary column detection: require 4+ rows and 5% coverage Prevents false narrow columns from text overflow at page edges. Session 355f3c84 had a 3-row/4% tertiary cluster creating a spurious third column from right-column text overflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 12:50:03 +01:00
Benjamin Admin	04092a0a66	Fix Step 5h: reject grammar patterns in slash-IPA, convert trailing variants - Reject /.../ matches containing spaces, parens, or commas (e.g. sb/sth up) - Second pass converts trailing /ipa2/ after [ipa1] (double pronunciation) - Validate standalone /ipa/ at start against same reject pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 12:40:28 +01:00
Benjamin Admin	7fafd297e7	Step 5h: convert slash-delimited IPA to bracket notation with dict lookup Dictionary-style pages print IPA between slashes (e.g. tiger /'taiga/). Step 5h detects these patterns, looks up the headword in the IPA dictionary for proper Unicode IPA, and falls back to OCR text when not found. Converts /ipa/ to [ipa] bracket notation matching the rest of the pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 12:36:08 +01:00
Benjamin Admin	7ac09b5941	Filter pipe-character word_boxes from OCR column divider artifacts Step 4d removes "\|" and "\|\|" word_boxes that OCR produces when reading physical vertical divider lines between columns. Also strips stray pipe chars from cell text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 12:09:50 +01:00
Benjamin Admin	a579c31ddb	Fix IPA continuation: skip words with inline IPA, recover emptied cells Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m46s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details Three fixes: 1. fix_ipa_continuation_cell: when headword has inline IPA like "beat [bˈiːt] , beat, beaten", only generate IPA for uncovered words (beaten), not words already shown (beat). When bracket is at end like "the Highlands [ˈhaɪləndz]", return inline IPA directly. 2. Step 5d: recover garbled IPA from word_boxes when Step 5c emptied the cell text (e.g. "[n, nn]" → ""). 3. Added 2 tests for inline IPA behavior (35 total). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 09:31:54 +01:00
Benjamin Admin	0f9c0d2ad0	Keep footer rows in table, mark with is_footer + col_type=footer Footer rows like "two hundred and twelve" are no longer removed from the grid. Instead they stay in cells/rows and get tagged so the frontend can render them differently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 09:08:25 +01:00
Benjamin Admin	278067fe20	Fix page_ref extraction: only extract cells matching page-ref pattern Column_1 cells like "to" (infinitive markers) were incorrectly extracted as page_refs. Now only cells matching p.70, ,.65, or bare digits are treated as page references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:55:55 +01:00
Benjamin Admin	d76fb2a9c8	Fix page_ref + footer extraction: extract individual cells, skip IPA footers Step 5g now extracts column_1 cells individually as page_refs (instead of requiring the whole row to be column_1-only), and footer detection skips rows containing real IPA Unicode symbols to avoid false positives on IPA continuation rows like [sˈiː] – [sˈɔː] – [sˈiːn]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:47:39 +01:00
Benjamin Admin	9681fcbd05	Strip IPA from headings + extract page_refs and footer from table Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 17s Details - Step 5f: Remove dictionary IPA from headings detected after IPA correction (e.g. "Theme [θˈiːm]" → "Theme") - Step 5g: Extract page_ref rows (column_1 only, e.g. "p.70") and footer rows (last single-cell row, e.g. page number "212") from the vocabulary table into zone-level metadata (page_refs, footer) so the frontend can render them separately Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:42:53 +01:00
Benjamin Admin	4290f70885	Fix unbracketed IPA continuations: detect garbled IPA in single-cell rows Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 24s Details CI / test-python-klausur (push) Failing after 1m42s Details CI / test-python-agent-core (push) Successful in 13s Details CI / test-nodejs-website (push) Successful in 14s Details Step 5d now also processes IPA continuations without brackets (e.g. "ska:f – ska:vz", "'sekandarr sku:l") when the row has only 1 content cell and the text is pure-ASCII garbled IPA (no real IPA Unicode symbols). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:30:44 +01:00
Benjamin Admin	5c935eec23	Refine garbled IPA filter: skip only pure-ASCII garbled text, not text with real IPA Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details "Theme [θˈiːm]" contains real IPA symbols (θ, ˈ) and should NOT be filtered. Only filter text that has garbled IPA markers (:, ') but no real Unicode IPA chars. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:15:51 +01:00
Benjamin Admin	c4a5cd2d8a	Skip garbled IPA text in single-cell heading detection Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m47s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details Unbracketed IPA continuations like "ska:f – ska:vz" were falsely detected as headings. Now _text_has_garbled_ipa() filters them out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:11:02 +01:00
Benjamin Admin	bc5ab29c06	Fix false positive: exclude first/last rows from single-cell heading detection Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 15s Details Page numbers like "two hundred and twelve" in the last row were falsely detected as headings. Now first and last non-header rows are excluded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:06:05 +01:00
Benjamin Admin	7c5d95b858	Fix heading col_index + detect black single-cell headings like "Theme" Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details - Color headings now preserve actual starting col_index instead of hardcoded 0 - New _detect_heading_rows_by_single_cell: detects rows with only 1 content cell (excl. page_ref) as headings — catches black headings like "Theme" that have normal color/height but are alone in their row - Runs after Step 5d (IPA continuation) to avoid false positives - 5 new tests (32 total) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:00:06 +01:00
Benjamin Admin	58c9565ba5	Fix en_col_type detection: use bracket IPA count instead of longest avg text Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 40s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 17s Details The previous heuristic picked the column with the longest average text as the English headword column. In layouts with long example sentences, this picked the wrong column (examples instead of headwords). Now counts cells with bracket patterns per column — the column with the most brackets is the headword column where IPA needs fixing. Fixes garbled OCR-IPA like "change [tfeind3]" → "change [tʃˈeɪndʒ]". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 06:50:47 +01:00
Benjamin Admin	92a7b85c2d	Fix IPA continuation: only process fully-bracketed cells, keep phrasal verb particles Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details Two fixes: 1. Step 5d now only treats cells as continuation when text is entirely inside brackets (e.g. "[n, nn]"). Cells with headwords outside brackets (e.g. "employee [im'ploi:]") are no longer overwritten. 2. fix_ipa_continuation_cell no longer skips grammar words like "down" — they are part of the headword in phrasal verbs like "close sth. down". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 00:43:51 +01:00
Benjamin Admin	5f89913a9a	Fix IPA continuation to check all columns, not just en_col_type Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 21s Details The en_col_type heuristic (longest avg text) picks the example column, missing IPA continuation cells in the actual headword column. Now Step 5d checks all column_* cells for garbled IPA patterns independently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 23:34:41 +01:00
Benjamin Admin	6bfa9eed86	Fix garbled IPA detection for bracket-notation like [n, nn] and [1uedtX,1] Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details - Detect bracketed text without real IPA symbols as garbled OCR phonetics - Allow IPA continuation fix even when other columns have content (for rows where EN cell is clearly garbled bracketed IPA) - Strip parenthetical grammar annotations like (no pl) from headword before IPA lookup in fix_ipa_continuation_cell Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 23:28:00 +01:00
Benjamin Admin	7750b2a05f	Fix ghost filter for borderless boxes + remove oversized graphic artifacts Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details 1. Skip ghost filtering for boxes with border_thickness=0 (images/graphics have no border lines to produce OCR artifacts like \|, I) 2. Remove individual word_boxes with height > 3x zone median (OCR from graphics like a huge "N" from a map image below text) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 23:04:00 +01:00
Benjamin Admin	e3395ae8cf	Fix overlay word leak, ghost filter false positive, merged zone header Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m57s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 41s Details 1. Filter words inside image_overlays (removes OCR from images) 2. Ghost filter: only remove single-char border artifacts, not multi-char like (= which is real content 3. Skip first-row header detection for zones with image_overlays (merged geometry creates artificial gaps) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 13:56:04 +01:00
Benjamin Admin	df30d4eae3	Add zone merging across images + heading detection by color/height Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 20s Details Zone merging: content zones separated by box zones (images) are merged into a single zone with image_overlays, so split tables reconnect. Heading detection: after color annotation, rows where all words are non-black and taller than 1.2x median are merged into spanning heading cells. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 12:22:11 +01:00
Benjamin Admin	fc0ab84e40	Fix garbled IPA in continuation rows using headword lookup IPA continuation rows (phonetic transcription that wraps below the headword) now get proper IPA by looking up headwords from the row above. E.g. "ska:f – ska:vz" → "[skˈɑːf] – [skˈɑːvz]". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 10:28:14 +01:00
Benjamin Admin	050d410ba0	Preserve IPA continuation rows in grid output Stop removing rows that contain only phonetic transcription below the headword. These rows are valid content that users need to see. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 10:22:58 +01:00
Benjamin Admin	432eee3694	Auto-filter decorative margin strips and header junk Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m45s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 15s Details - _filter_decorative_margin: Phase 2 now also removes short words (<=3 chars) in the same narrow x-range as the detected single-char strip, catching multi-char OCR artifacts like "Vv" from alphabet graphics. - _filter_header_junk: New filter detects the content start (first row with 3+ high-confidence words) and removes low-conf short fragments above it that are OCR artifacts from header illustrations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 09:38:24 +01:00
Benjamin Admin	f9d71d50d1	Add exclude region marking in Structure step Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m47s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details Users can now draw rectangles on the document image in the Structure Detection step to mark areas (e.g. header graphics, alphabet strips) that should be excluded from OCR results during grid building. - Backend: PUT/DELETE endpoints for exclude regions stored in structure_result - Backend: _build_grid_core() filters all words inside user-defined exclude regions - Frontend: Interactive rectangle drawing with visual overlay and delete buttons - Preserve exclude regions when re-running structure detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 09:08:30 +01:00
Benjamin Admin	f655db30e4	Add Ground Truth regression test system for OCR pipeline Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 35s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m47s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 22s Details Extract _build_grid_core() from build_grid() endpoint for reuse. New ocr_pipeline_regression.py with endpoints to mark sessions as ground truth, list them, and run regression comparisons after code changes. Frontend button in StepGroundTruth.tsx to mark/update GT. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 13:46:48 +01:00
Benjamin Admin	c894a0feeb	Improve IPA continuation row detection with phonetic heuristics Strip IPA brackets that fix_cell_phonetics may have added for short dictionary words (e.g. "si" → "[si]") before checking if the row is a garbled phonetic continuation. Detect phonetic text by presence of ':' (length marks), leading apostrophe (stress marks), or absence of any word with ≥3 letters. Fixes Row 39 ("si: [si] — So: - si:n") not being removed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 12:08:21 +01:00
Benjamin Admin	8ef4c089cf	Remove IPA continuation rows and support hyphenated word lookup - grid_editor_api: After IPA correction, detect rows containing only garbled phonetics in the English column (no German translation, no IPA brackets inserted). These are wrap-around lines where printed IPA extends to the line below the headword. Remove them since the headword row already has correct IPA. - cv_ocr_engines: _insert_missing_ipa now tries dehyphenated form as fallback (e.g. "second-hand" → "secondhand") for dictionary lookup, fixing IPA insertion for compound words. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 12:05:38 +01:00
Benjamin Admin	821e5481c2	Only apply IPA correction on vocabulary tables (≥3 columns) Single-column German text pages were getting IPA inserted for words that happen to exist in the English dictionary ("die" → [dˈaɪ], "Das" → [dɑs]). Now IPA correction only runs when the grid has ≥3 columns, which is the minimum for a vocabulary table layout (English \| article \| German). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 11:50:03 +01:00
Benjamin Admin	f139d0903e	Preserve alphabetic marker columns, broaden junk filter, enable IPA in grid - _merge_inline_marker_columns: skip merge when ≥50% of words are alphabetic (preserves "to", "in", "der" columns) - Rule 2 (oversized stub): widen to ≤3 words / ≤5 chars (catches "SEA &") - IPA phonetics: map longest-avg-text column to column_en so fix_cell_phonetics runs in the grid editor - ocr_pipeline_overlays: add missing split_page_into_zones import Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 11:08:23 +01:00
Benjamin Admin	962bbbe9f6	Remove scattered debris rows and disable spanning header detection - Add Rule 3 to junk-row filter: rows where no word is longer than 2 chars are removed as scattered OCR debris from illustrations - Fully disable spanning-header detection which falsely flagged IPA transcriptions and vocabulary entries as spanning headers - First-row heuristic remains for genuine header detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 10:47:17 +01:00
Benjamin Admin	9da45c2a59	Fix false header detection and add decorative margin/footer filters - Remove all_colored spanning header heuristic that falsely flagged colored vocabulary entries (Scotland, secondary school) as headers - Add _filter_decorative_margin: removes vertical A-Z alphabet strips along page margins (single-char words in a compact vertical strip) - Add _filter_footer_words: removes page numbers in bottom 5% of page - Tighten spanning header rule: require ≥3 columns spanned + ≤3 words Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 10:38:20 +01:00
Benjamin Admin	00cbf266cb	Add oversized-stub filter for large page numbers/marks in grid rows Rows with ≤2 words, total text ≤3 chars, and word height >1.8x median are removed as non-content elements (e.g. red page number "( 9"). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 09:05:07 +01:00
Benjamin Admin	f9bad7beaa	Filter phantom rows from recovered color artifacts and low-conf OCR noise - Apply recovered-artifact filter to ALL zones (was box-zones only) - Filter any recovered word with text ≤ 2 chars (not just !?•·) - Add post-grid junk-row removal: rows where all word_boxes have conf < 50 and text ≤ 3 chars are dropped as OCR noise Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 09:00:43 +01:00

1 2

71 Commits