breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	2a21127f01	fix(ocr-pipeline): improve page crop spine detection and cell assignment Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details 1. page_crop: Score all dark runs by center-proximity × darkness × narrowness instead of picking the widest. Fixes ad810209 where a wide dark area at 35% was chosen over the actual spine at 50%. 2. cv_words_first: Replace x-center-only word→column assignment with overlap-based three-pass strategy (overlap → midpoint-range → nearest). Fixes truncated German translations like "Schal" instead of "Schal - die Schals" in session 079cd0d9. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 09:23:30 +01:00
Benjamin Admin	aae8a96aa2	fix: sort word_boxes in reading order (Y-grouped, then X-sorted) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 21s Details Words on the same visual line can have slightly different top values (1-6px). Sorting by (top, left) produced wrong word order in the frontend display. Now uses _group_words_into_lines to group by Y proximity first, then sort by X within each line. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 10:41:30 +01:00
Benjamin Admin	febd0a2f84	fix: border ghost filter + row overlap fix for box zones Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details 1. Add _filter_border_ghosts() to grid editor - removes OCR artefacts like \| sitting on box borders before row/column clustering. The tall \| (h=55) was inflating row 0's y_max, causing row overlap. 2. Fix _assign_word_to_row() to prefer closest y_center when rows overlap, instead of always returning the first matching row. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 09:54:50 +01:00
Benjamin Admin	0340204c1f	feat: box-aware column detection — exclude box content from global columns Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m4s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details - Enrich column geometries with original full-page words (box-filtered) so _detect_sub_columns() finds narrow sub-columns across box boundaries - Add inline marker guard: bullet points (1., 2., •) are not split into sub-columns (minimum gap check: 1.2× word height or 20px) - Add box_rects parameter to build_grid_from_words() — words inside boxes are excluded from X-gap column clustering - Pass box rects from zones to words_first grid builder - Add 9 tests for box-aware column detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 18:42:46 +01:00
Benjamin Admin	1f527fcd49	fix: split PaddleOCR boxes at leading ! for overlay word positioning Some checks failed CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details When PaddleOCR returns "!Betonung" as a single word box, the overlay positions text starting at the "!" instead of the actual word. Split such boxes into ["!", "Betonung"] with proportional position splitting, matching the existing IPA bracket splitting logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 17:46:17 +01:00
Benjamin Admin	3e65b14b83	fix: split PaddleOCR boxes at IPA brackets for overlay positioning Some checks failed CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details PaddleOCR returns "badge[bxd3]" without space, but the IPA fixer produces "badge [bˈædʒ]" with space, creating a token count mismatch between cell.text and word_boxes. Now also split at "[" boundaries so each IPA bracket gets its own sub-box. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:08:17 +01:00
Benjamin Admin	40ac593d28	fix: split PaddleOCR phrase boxes into per-word boxes for overlay slide Some checks failed CI / test-nodejs-website (push) Has been cancelled Details CI / go-lint (push) Has been cancelled Details CI / python-lint (push) Has been cancelled Details CI / nodejs-lint (push) Has been cancelled Details CI / test-go-school (push) Has been cancelled Details CI / test-go-edu-search (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details PaddleOCR returns phrase-level bounding boxes (e.g. "competition [kompa'tifn]" as one box) but the overlay slide mechanism expects one box per word for accurate positioning. Multi-word boxes are now split proportionally by character count with small gaps between words. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:00:06 +01:00
Benjamin Admin	ea69239e06	fix: word_boxes in words_first use absolute pixels (consistent with v2 grid) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 39s Details CI / test-go-edu-search (push) Successful in 33s Details CI / test-python-klausur (push) Failing after 2m21s Details CI / test-python-agent-core (push) Successful in 22s Details CI / test-nodejs-website (push) Successful in 33s Details words_first was storing word_boxes in percent coordinates while cv_cell_grid.py uses absolute pixel coordinates. The overlay slide mechanism divides by imgW to get percentages, so percent-in-percent caused positions near zero. Now both grid builders use the same format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 15:04:04 +01:00
Benjamin Admin	ced5bb3dd3	feat: Words-First Grid Builder (bottom-up alternative zu cell_grid_v2) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 54s Details CI / test-go-edu-search (push) Successful in 47s Details CI / test-python-klausur (push) Failing after 2m31s Details CI / test-python-agent-core (push) Successful in 23s Details CI / test-nodejs-website (push) Successful in 32s Details Neuer Algorithmus in cv_words_first.py: Clustert Tesseract word_boxes direkt zu Spalten (X-Gap) und Zeilen (Y-Proximity), baut Zellen an Schnittpunkten. Kein Spalten-/Zeilenerkennung noetig. - cv_words_first.py: _cluster_columns, _cluster_rows, _build_cells, build_grid_from_words - ocr_pipeline_api.py: grid_method Parameter (v2\|words_first) im /words Endpoint - StepWordRecognition.tsx: Dropdown Toggle fuer Grid-Methode - OCR-Pipeline.md: Doku v4.3.0 mit Words-First Algorithmus - 15 Unit-Tests fuer cv_words_first Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 06:46:05 +01:00

9 Commits