12 Commits

Author SHA1 Message Date
Benjamin Admin
65f4ce1947 feat: ImageLayoutEditor, arrow-key nav, multi-select bold, wider columns
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 32s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
- New ImageLayoutEditor: SVG overlay on original scan with draggable
  column dividers, horizontal guidelines (margins/header/footer),
  double-click to add columns, x-button to delete
- GridTable: MIN_COL_WIDTH 40→80px for better readability
- Arrow up/down keys navigate between rows in the grid editor
- Ctrl+Click for multi-cell selection, Ctrl+B to toggle bold on selection
- getAdjacentCell works for cells that don't exist yet (new rows/cols)
- deleteColumn now merges x-boundaries correctly
- Session restore fix: grid_editor_result/structure_result in session GET
- Footer row 3-state cycle, auto-create cells for empty footer rows
- Grid save/build/GT-mark now advance current_step=11

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 07:45:39 +01:00
Benjamin Admin
0340204c1f feat: box-aware column detection — exclude box content from global columns
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m4s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 19s
- Enrich column geometries with original full-page words (box-filtered)
  so _detect_sub_columns() finds narrow sub-columns across box boundaries
- Add inline marker guard: bullet points (1., 2., •) are not split into
  sub-columns (minimum gap check: 1.2× word height or 20px)
- Add box_rects parameter to build_grid_from_words() — words inside boxes
  are excluded from X-gap column clustering
- Pass box rects from zones to words_first grid builder
- Add 9 tests for box-aware column detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 18:42:46 +01:00
Benjamin Admin
e3ee1de790 Revert "fix: Zeilen-Regularisierung im Overlay ueberspringen (generisch fuer gemischte Inhalte)"
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m2s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 24s
This reverts commit b91f799ccf.
2026-03-11 08:44:07 +01:00
Benjamin Admin
b91f799ccf fix: Zeilen-Regularisierung im Overlay ueberspringen (generisch fuer gemischte Inhalte)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 49s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m21s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 26s
Seiten mit Info-Boxen (andere Zeilenhoehe) fuehren dazu, dass _regularize_row_grid
die Zeilenpositionen verzerrt. Neuer skip_regularize Parameter nutzt stattdessen
die gap-basierten Zeilen, die der tatsaechlichen Seitengeometrie folgen.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 08:29:06 +01:00
Benjamin Admin
2716495250 fix: Sub-Session Zeilenerkennung — Tesseract+inv im Spalten-Schritt cachen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m9s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 20s
Bisher wurden _word_dicts, _inv und _content_bounds fuer Sub-Sessions
nicht gecacht, sodass detect_rows auf detect_column_geometry() zurueckfiel.
Das konnte bei kleinen Box-Bildern mit <5 Woertern fehlschlagen.

Jetzt laeuft Tesseract + Binarisierung direkt im Pseudo-Spalten-Block,
und die Intermediates werden gecacht. Zusaetzlich ausfuehrliche Kommentare
zur Zeilenerkennung (detect_row_geometry, _regularize_row_grid).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 08:43:26 +01:00
Benjamin Admin
4610137ecc fix: Box-Bereiche aus Bild entfernen statt pro Zone separat Spalten erkennen
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Content-Streifen oberhalb/unterhalb von Boxen werden zu einem Bild zusammengefügt,
Spaltenerkennung läuft einmal auf dem kombinierten Bild. Entfernt Step 5c
(suspicion-based gap alignment), da der neue Ansatz das Problem an der Wurzel löst.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 17:03:05 +01:00
Benjamin Admin
fb46450802 fix: Alignment-Validierung nur fuer verdaechtige Gaps (>2x Median-Breite)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 20s
Vorher wurden alle internen Gaps geprueft, was echte Spaltentrennungen
(EN→DE) faelschlicherweise entfernte. Jetzt werden nur Gaps geprueft,
die eine unverhaeltnismaessig breite rechte Spalte erzeugen wuerden
(>2x Median-Spaltenbreite). Schwelle auf 15% gesenkt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 16:27:14 +01:00
Benjamin Admin
11126c4436 fix: UnboundLocalError edge_tolerance in Step 5c
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 19s
Variable wurde vor ihrer Definition in Step 7 referenziert.
Eigene margin_thresh Variable fuer Step 5c eingefuehrt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 16:18:47 +01:00
Benjamin Admin
7a0ded7562 fix: Left-Edge-Alignment-Validierung fuer Spalten-Gaps
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m7s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 19s
Interiore Gaps werden jetzt geprueft: rechts des Gaps muessen
mindestens 25% der Woerter eine gemeinsame linke Kante teilen.
Verhindert falsche Spaltentrennungen innerhalb breiter Spalten
(z.B. Example-Spalte mit kurzen und langen Eintraegen).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 16:11:58 +01:00
Benjamin Admin
cf9dde9876 fix: _group_words_into_lines nach cv_ocr_engines.py verschieben
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m4s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 21s
Funktion war nur in cv_review.py definiert, wurde aber auch in
cv_ocr_engines.py und cv_layout.py benutzt — NameError zur Laufzeit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 15:24:56 +01:00
Benjamin Admin
7005b18561 feat: generische Box-Erkennung fuer zonenbasierte Spaltenerkennung
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 19s
- Neue Datei cv_box_detect.py: 2-Stufen-Algorithmus (Linien + Farbe)
- DetectedBox/PageZone Dataclasses in cv_vocab_types.py
- detect_column_geometry_zoned() in cv_layout.py
- API-Endpoints erweitert: zones/boxes_detected im column_result
- Overlay-Funktionen zeichnen Box-Grenzen als gestrichelte Rechtecke
- Fix: numpy array or-Verknuepfung an 7 Stellen in ocr_pipeline_api.py
- 12 Unit-Tests fuer Box-Erkennung und Zone-Splitting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 15:06:23 +01:00
Benjamin Admin
9a5a35bff1 refactor: cv_vocab_pipeline.py in 6 Module aufteilen (8163 → 6 + Fassade)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Monolithische 8163-Zeilen-Datei aufgeteilt in fokussierte Module:
- cv_vocab_types.py (156 Z.): Dataklassen, Konstanten, IPA, Feature-Flags
- cv_preprocessing.py (1166 Z.): Bild-I/O, Orientierung, Deskew, Dewarp
- cv_layout.py (3036 Z.): Dokumenttyp, Spalten, Zeilen, Klassifikation
- cv_ocr_engines.py (1282 Z.): OCR-Engines, Vocab-Postprocessing, Text-Cleaning
- cv_cell_grid.py (1510 Z.): Cell-Grid v2+Legacy, Vocab-Konvertierung
- cv_review.py (1184 Z.): LLM/Spell Review, Pipeline-Orchestrierung

cv_vocab_pipeline.py ist jetzt eine Re-Export-Fassade (35 Z.) —
alle bestehenden Imports bleiben unveraendert.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 23:46:47 +01:00