breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	46c8c28d34	fix: border strip pre-filter + 3-column detection for vocabulary tables The border strip filter (Step 4e) used the LARGEST x-gap which incorrectly removed base words along with edge artifacts. Now uses a two-stage approach: 1. _filter_border_strip_words() pre-filters raw words BEFORE column detection, scanning from the page edge inward to find the FIRST significant gap (>30px) 2. Step 4e runs as fallback only when pre-filter didn't apply Session 4233 now correctly detects 3 columns (base word \| oder \| synonyms) instead of 2. Threshold raised from 15% to 20% to handle pages with many edge artifacts. All 4 ground-truth sessions pass regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 21:01:43 +01:00
Benjamin Admin	2acf8696bf	fix: correct border strip test data to avoid false internal gaps Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 36s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 17s Details Content word_boxes in test used x-spacing (i%3)100 which created internal gaps larger than the border-to-content gap. Changed to (i%2)51 so content words overlap and the border gap remains dominant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 17:24:33 +01:00
Benjamin Admin	c0e1118870	feat: detect and remove page-border decoration strip artifacts (Step 4e) Textbooks with decorative alphabet strips along page edges produce OCR artifacts (scattered colored letters at x<150 while real content starts at x>=179). Step 4e detects a significant x-gap (>30px) between a small cluster (<15% of total word_boxes) near the page edge and the main content, then removes the border-strip word_boxes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 17:20:45 +01:00
Benjamin Admin	f31a7175a2	fix: normalize word_box order to reading order for frontend display (Step 5j) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details The frontend renders colored cells from the word_boxes array order, not from cell.text. After post-processing steps (5i bullet removal etc), word_boxes could remain in their original insertion order instead of left-to-right reading order. Step 5j now explicitly sorts word_boxes using _group_words_into_lines before the result is built. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 19:21:37 +01:00
Benjamin Admin	2c63beff04	Fix bullet overlap disambiguation + raise red threshold to 90 Step 5i: For word_boxes with >90% x-overlap and different text, use IPA dictionary to decide which to keep (e.g. "tightly" in dict, "fighily" not). Red threshold raised from 80 to 90 to catch remaining scanner artifacts like "tight" and "5" that were still misclassified as red. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 18:21:00 +01:00
Benjamin Admin	82433b4bad	Step 5i: Remove blue bullet/artifact and overlapping duplicate word_boxes Dictionary pages have small blue square bullets before entries that OCR reads as text artifacts. Three detection rules: a) Tiny blue symbols (area < 150, conf < 85): catches ©, e, * etc. b) X-overlapping word_boxes (>40%): remove lower confidence one c) Duplicate blue text with gap < 6px: remove one copy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 18:17:07 +01:00
Benjamin Admin	d889a6959e	Fix red false-positive in color detection for scanned black text Scanner artifacts on black text produce slight warm tint (hue ~0, sat ~60) that was misclassified as red. Now requires median_sat >= 80 specifically for red classification, since genuine red text always has high saturation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 17:18:44 +01:00
Benjamin Admin	04092a0a66	Fix Step 5h: reject grammar patterns in slash-IPA, convert trailing variants - Reject /.../ matches containing spaces, parens, or commas (e.g. sb/sth up) - Second pass converts trailing /ipa2/ after [ipa1] (double pronunciation) - Validate standalone /ipa/ at start against same reject pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 12:40:28 +01:00
Benjamin Admin	7fafd297e7	Step 5h: convert slash-delimited IPA to bracket notation with dict lookup Dictionary-style pages print IPA between slashes (e.g. tiger /'taiga/). Step 5h detects these patterns, looks up the headword in the IPA dictionary for proper Unicode IPA, and falls back to OCR text when not found. Converts /ipa/ to [ipa] bracket notation matching the rest of the pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 12:36:08 +01:00
Benjamin Admin	7ac09b5941	Filter pipe-character word_boxes from OCR column divider artifacts Step 4d removes "\|" and "\|\|" word_boxes that OCR produces when reading physical vertical divider lines between columns. Also strips stray pipe chars from cell text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 12:09:50 +01:00
Benjamin Admin	ef5aed6a98	Preserve grammar annotations (pl), (no pl) and skip articles in IPA Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details Two fixes: 1. Add pl, sg, no, also, ae, be etc. to _GRAMMAR_BRACKET_WORDS so annotations like (pl) and (no pl) are not replaced with IPA. 2. Skip articles (the, a, an) in fix_ipa_continuation_cell — they never get IPA in vocabulary books. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 11:42:44 +01:00
Benjamin Admin	a579c31ddb	Fix IPA continuation: skip words with inline IPA, recover emptied cells Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m46s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details Three fixes: 1. fix_ipa_continuation_cell: when headword has inline IPA like "beat [bˈiːt] , beat, beaten", only generate IPA for uncovered words (beaten), not words already shown (beat). When bracket is at end like "the Highlands [ˈhaɪləndz]", return inline IPA directly. 2. Step 5d: recover garbled IPA from word_boxes when Step 5c emptied the cell text (e.g. "[n, nn]" → ""). 3. Added 2 tests for inline IPA behavior (35 total). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 09:31:54 +01:00
Benjamin Admin	bc5ab29c06	Fix false positive: exclude first/last rows from single-cell heading detection Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 15s Details Page numbers like "two hundred and twelve" in the last row were falsely detected as headings. Now first and last non-header rows are excluded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:06:05 +01:00
Benjamin Admin	7c5d95b858	Fix heading col_index + detect black single-cell headings like "Theme" Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details - Color headings now preserve actual starting col_index instead of hardcoded 0 - New _detect_heading_rows_by_single_cell: detects rows with only 1 content cell (excl. page_ref) as headings — catches black headings like "Theme" that have normal color/height but are alone in their row - Runs after Step 5d (IPA continuation) to avoid false positives - 5 new tests (32 total) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:00:06 +01:00
Benjamin Admin	92a7b85c2d	Fix IPA continuation: only process fully-bracketed cells, keep phrasal verb particles Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details Two fixes: 1. Step 5d now only treats cells as continuation when text is entirely inside brackets (e.g. "[n, nn]"). Cells with headwords outside brackets (e.g. "employee [im'ploi:]") are no longer overwritten. 2. fix_ipa_continuation_cell no longer skips grammar words like "down" — they are part of the headword in phrasal verbs like "close sth. down". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 00:43:51 +01:00
Benjamin Admin	3c7fc43f43	Fix test expectation: valid IPA in brackets also triggers detection Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m6s Details CI / test-python-agent-core (push) Successful in 39s Details CI / test-nodejs-website (push) Successful in 17s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 23:30:24 +01:00
Benjamin Admin	6bfa9eed86	Fix garbled IPA detection for bracket-notation like [n, nn] and [1uedtX,1] Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details - Detect bracketed text without real IPA symbols as garbled OCR phonetics - Allow IPA continuation fix even when other columns have content (for rows where EN cell is clearly garbled bracketed IPA) - Strip parenthetical grammar annotations like (no pl) from headword before IPA lookup in fix_ipa_continuation_cell Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 23:28:00 +01:00
Benjamin Admin	7750b2a05f	Fix ghost filter for borderless boxes + remove oversized graphic artifacts Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details 1. Skip ghost filtering for boxes with border_thickness=0 (images/graphics have no border lines to produce OCR artifacts like \|, I) 2. Remove individual word_boxes with height > 3x zone median (OCR from graphics like a huge "N" from a map image below text) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 23:04:00 +01:00
Benjamin Admin	e3395ae8cf	Fix overlay word leak, ghost filter false positive, merged zone header Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m57s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 41s Details 1. Filter words inside image_overlays (removes OCR from images) 2. Ghost filter: only remove single-char border artifacts, not multi-char like (= which is real content 3. Skip first-row header detection for zones with image_overlays (merged geometry creates artificial gaps) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 13:56:04 +01:00
Benjamin Admin	df30d4eae3	Add zone merging across images + heading detection by color/height Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 20s Details Zone merging: content zones separated by box zones (images) are merged into a single zone with image_overlays, so split tables reconnect. Heading detection: after color annotation, rows where all words are non-black and taller than 1.2x median are merged into spanning heading cells. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 12:22:11 +01:00

20 Commits