fix: Zeilen-Regularisierung im Overlay ueberspringen (generisch fuer gemischte Inhalte)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 49s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m21s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 26s

Seiten mit Info-Boxen (andere Zeilenhoehe) fuehren dazu, dass _regularize_row_grid
die Zeilenpositionen verzerrt. Neuer skip_regularize Parameter nutzt stattdessen
die gap-basierten Zeilen, die der tatsaechlichen Seitengeometrie folgen.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-11 08:29:06 +01:00
parent 2df2a01a8b
commit b91f799ccf
4 changed files with 23 additions and 9 deletions

View File

@@ -1525,6 +1525,7 @@ def detect_row_geometry(
word_dicts: List[Dict],
left_x: int, right_x: int,
top_y: int, bottom_y: int,
skip_regularize: bool = False,
) -> List['RowGeometry']:
"""Detect row geometry using horizontal whitespace-gap analysis.
@@ -1789,8 +1790,13 @@ def detect_row_geometry(
# and evenly-spaced rows than the gap-based approach alone.
# Also detects section breaks (headings, paragraphs) where the pitch
# exceeds 1.8× the median, and handles each section independently.
rows = _regularize_row_grid(rows, word_dicts, left_x, right_x, top_y,
content_w, content_h, inv)
#
# skip_regularize=True: Keep gap-based rows as-is. Useful for full-page
# overlay rendering where mixed content (info boxes, different line
# spacings) must preserve original geometry faithfully.
if not skip_regularize:
rows = _regularize_row_grid(rows, word_dicts, left_x, right_x, top_y,
content_w, content_h, inv)
type_counts = {}
for r in rows: