Commit Graph

12 Commits

Author SHA1 Message Date
Benjamin Admin
92a52a3199 fix: apply column union when total_cols >= max (not just >)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
Zone 4 found 4 columns incl. page_ref, union also yields 4.
The strict > check prevented union from applying to Zone 0.
Changed to >= so all content zones get the merged column set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 00:14:59 +01:00
Benjamin Admin
427fecdce0 fix: union column detection across all content zones
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Instead of propagating columns from the largest content zone only
(which missed narrow columns like page_ref), collect column split
points from ALL content zones and merge them. This way a column
found in any zone (e.g. page_ref at x=132 in the zone below boxes)
is available everywhere.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 23:02:33 +01:00
Benjamin Admin
9fb3229270 fix: lower tertiary gap threshold for narrow margin column detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
Reduce gap threshold from max(40, 5%) to max(30, 2%) so page_ref
columns (e.g. p.55/p.57) at ~56px gap are detected as tertiary columns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 22:56:03 +01:00
Benjamin Admin
91625a2646 fix: add tertiary tier for narrow margin columns (page refs, markers)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
Page references (p.55, p.57) and marker columns (!) appear in very few
rows (< 12% coverage) but sit at the far left/right margin with a clear
gap to the main content.  Add a third detection tier that catches these
narrow margin columns when they have >= 2 distinct rows and are within
15% of the content edge with >= 40px gap to the nearest main column.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 22:40:40 +01:00
Benjamin Admin
02ae6249ca fix: propagate columns from largest content zone instead of global detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 21s
Global column detection diluted narrow sub-columns (page refs, markers)
because they appeared in too few rows relative to the total.  Instead,
detect columns per zone independently, then propagate the best columns
(from the content zone with the most words) to smaller content zones.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 22:30:15 +01:00
Benjamin Admin
cf995f2d52 fix: global column detection across content zones in Kombi grid builder
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m3s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 26s
Content zones (above/between/below boxes) now share the same column
structure: columns are detected once from ALL content-zone words, then
applied to each content zone.  Box zones still detect columns independently.

This fixes the issue where narrow columns (page refs like p.55) were not
detected in small content zones above boxes, even though the same column
existed in the larger content zone below the box.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 22:04:17 +01:00
Benjamin Admin
bcd55e12d7 fix: run color annotation on final cell word_boxes, not pre-grid words
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 16s
_build_cells() creates new word_box dicts, so color fields set before
grid building were lost. Now detect_word_colors() runs after cells
are built, on the final word_boxes. Recovery still runs before grid
building so recovered words participate in column/row detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:53:04 +01:00
Benjamin Admin
2bd63ec402 feat: add color detection for OCR word boxes
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-nodejs-website (push) Has been cancelled
CI / test-python-agent-core (push) Has been cancelled
New cv_color_detect.py module:
- detect_word_colors(): annotates existing words with text color (HSV analysis)
- recover_colored_text(): finds colored text regions missed by standard OCR
  (e.g. red ! markers) using HSV masks + contour detection

Integrated into build-grid: words get color/color_name fields, recovered
colored regions are merged into the word list before grid building.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:50:09 +01:00
Benjamin Admin
39a4d8564c chore: add per-cluster debug logging for column alignment detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:18:28 +01:00
Benjamin Admin
1162eac7b4 fix: use group-start positions for column detection, not all word left-edges
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
Only cluster left-edges of words that begin a new group within their row
(first word or preceded by a large gap). This filters out mid-phrase
word positions (IPA transcriptions, second words in multi-word entries)
that were causing too many false columns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:10:29 +01:00
Benjamin Admin
28352f5bab feat: replace gap-based column detection with left-edge alignment algorithm
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 17s
Column detection now clusters word left-edges by X-proximity and filters
by row coverage (Y-coverage), matching the proven approach from cv_layout.py
but using precise OCR word positions instead of ink-based estimates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 00:03:58 +01:00
Benjamin Admin
c3f1547e32 feat: add Excel-like grid editor for OCR overlay (Kombi mode step 6)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
Backend: new grid_editor_api.py with build-grid endpoint that detects
bordered boxes, splits page into zones, clusters columns/rows per zone
from Kombi word positions. New DB column grid_editor_result JSONB.

Frontend: GridEditor component with editable HTML tables per zone,
column bold toggle, header row toggle, undo/redo, keyboard navigation
(Tab/Enter/Arrow), image overlay verification, and save/load.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 23:41:03 +01:00