Fix IPA continuation: skip words with inline IPA, recover emptied cells

Three fixes: 1. fix_ipa_continuation_cell: when headword has inline IPA like "beat [bˈiːt] , beat, beaten", only generate IPA for uncovered words (beaten), not words already shown (beat). When bracket is at end like "the Highlands [ˈhaɪləndz]", return inline IPA directly. 2. Step 5d: recover garbled IPA from word_boxes when Step 5c emptied the cell text (e.g. "[n, nn]" → ""). 3. Added 2 tests for inline IPA behavior (35 total). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep footer rows in table, mark with is_footer + col_type=footer
2026-03-20 09:31:54 +01:00 · 2026-03-20 09:08:25 +01:00 · 2026-03-20 08:55:55 +01:00 · 2026-03-20 08:47:39 +01:00
3 changed files with 108 additions and 44 deletions
--- a/klausur-service/backend/cv_ocr_engines.py
+++ b/klausur-service/backend/cv_ocr_engines.py
@@ -1250,6 +1250,32 @@ def fix_ipa_continuation_cell(
    if not IPA_AVAILABLE or not garbled_text or not headword_text:
        return garbled_text

+    # If headword already has inline IPA like "beat [bˈiːt] , beat, beaten",
+    # only generate continuation IPA for words NOT already covered.
+    covered_words: set = set()
+    has_inline_ipa = bool(re.search(r'\[[^\]]*\]', headword_text))
+    if has_inline_ipa:
+        # Words before the first bracket already have their IPA shown
+        first_bracket = headword_text.index('[')
+        pre_bracket = headword_text[:first_bracket].strip()
+        for w in pre_bracket.split():
+            clean = re.sub(r'[^a-zA-Z\'-]', '', w).lower()
+            if clean and len(clean) >= 2:
+                covered_words.add(clean)
+
+        last_bracket_end = headword_text.rfind(']')
+        tail = headword_text[last_bracket_end + 1:].strip()
+
+        if not tail or not re.search(r'[a-zA-Z]{2,}', tail):
+            # Bracket is at the end (e.g. "the Highlands [ˈhaɪləndz]")
+            # — return the inline IPA directly (continuation duplicates it)
+            last_bracket_start = headword_text.rfind('[')
+            inline_ipa = headword_text[last_bracket_start:last_bracket_end + 1]
+            return inline_ipa
+
+        # Only the tail words need continuation IPA
+        headword_text = tail
+
    # Strip existing IPA brackets and parenthetical grammar annotations
    # like "(no pl)", "(sth)", "(sb)" from headword text
    clean_hw = re.sub(r'\[[^\]]*\]', '', headword_text)
@@ -1270,6 +1296,7 @@ def fix_ipa_continuation_cell(
    # Do NOT skip grammar words here — they are integral parts of the
    # headword (e.g. "close down", "the United Kingdom").  Grammar
    # annotations like "(sth)", "(no pl)" are already stripped above.
+    # Skip words that already have inline IPA in the headword row.
    ipa_parts: List[str] = []
    for part in parts:
        # A part may be multi-word like "secondary school"
@@ -1279,6 +1306,8 @@ def fix_ipa_continuation_cell(
            clean_w = re.sub(r'[^a-zA-Z\'-]', '', w)
            if not clean_w or len(clean_w) < 2:
                continue
+            if covered_words and clean_w.lower() in covered_words:
+                continue  # Already has IPA inline in the headword
            ipa = _lookup_ipa(clean_w, pronunciation)
            if ipa:
                word_ipas.append(ipa)
--- a/klausur-service/backend/grid_editor_api.py
+++ b/klausur-service/backend/grid_editor_api.py
@@ -1798,7 +1798,13 @@ async def _build_grid_core(session_id: str, session: dict) -> dict:
                        continue
                    cell_text = (cell.get("text") or "").strip()
                    if not cell_text:
-                        continue
+                        # Step 5c may have emptied garbled IPA cells like
+                        # "[n, nn]" — recover text from word_boxes.
+                        wb_texts = [w.get("text", "")
+                                    for w in cell.get("word_boxes", [])]
+                        cell_text = " ".join(wb_texts).strip()
+                        if not cell_text:
+                            continue

                    is_bracketed = (
                        cell_text.startswith('[') and cell_text.endswith(']')
@@ -1877,11 +1883,15 @@ async def _build_grid_core(session_id: str, session: dict) -> dict:
            if stripped and stripped != text:
                cell["text"] = stripped

-    # 5g. Extract page_ref rows and footer rows from content zones.
-    # Page references (column_1 cells like "p.70") and footer lines
-    # (e.g. "two hundred and twelve" = page number) should not be part
-    # of the vocabulary table.  Move them to zone-level metadata so the
-    # frontend can display them separately.
+    # 5g. Extract page_ref cells and footer rows from content zones.
+    # Page references (column_1 cells like "p.70") sit in rows that
+    # also contain vocabulary — extract them as zone metadata without
+    # removing the row.  Footer lines (e.g. "two hundred and twelve"
+    # = page number at bottom) are standalone rows that should be
+    # removed from the table entirely.
+    _REAL_IPA_CHARS_SET = set("ˈˌəɪɛɒʊʌæɑɔʃʒθðŋ")
+    # Page-ref pattern: "p.70", "P.70", ",.65" (garbled "p"), or bare "70"
+    _PAGE_REF_RE = re.compile(r'^[pP,]?\s*\.?\s*\d+$')
    for z in zones_data:
        if z.get("zone_type") != "content":
            continue
@@ -1890,53 +1900,61 @@ async def _build_grid_core(session_id: str, session: dict) -> dict:
        if not rows:
            continue

+        # Extract column_1 cells that look like page references
        page_refs = []
-        footer_rows = []
-
-        # Detect page_ref rows: rows where the ONLY cell is column_1
-        # (just a page number like "p.65", "p.70")
-        for row in rows:
-            if row.get("is_header"):
+        page_ref_cell_ids = set()
+        for cell in cells:
+            if cell.get("col_type") != "column_1":
                continue
-            ri = row["index"]
-            row_cells = [c for c in cells if c.get("row_index") == ri]
-            if (len(row_cells) == 1
-                    and row_cells[0].get("col_type") == "column_1"):
-                page_refs.append({
-                    "row_index": ri,
-                    "text": (row_cells[0].get("text") or "").strip(),
-                    "bbox_pct": row_cells[0].get("bbox_pct", {}),
-                })
+            text = (cell.get("text") or "").strip()
+            if not text:
+                continue
+            if not _PAGE_REF_RE.match(text):
+                continue
+            page_refs.append({
+                "row_index": cell.get("row_index"),
+                "text": text,
+                "bbox_pct": cell.get("bbox_pct", {}),
+            })
+            page_ref_cell_ids.add(cell.get("cell_id"))

-        # Detect footer: last non-header row if it has only 1 content
-        # cell and no column_1 page_ref (standalone text like page num)
+        # Remove page_ref cells from the table (but keep their rows)
+        if page_ref_cell_ids:
+            z["cells"] = [c for c in z["cells"]
+                          if c.get("cell_id") not in page_ref_cell_ids]
+
+        # Detect footer: last non-header row if it has only 1 cell
+        # and the text is NOT IPA (no real IPA Unicode symbols).
+        # This catches page numbers like "two hundred and twelve".
+        footer_rows = []
        non_header_rows = [r for r in rows if not r.get("is_header")]
        if non_header_rows:
            last_row = non_header_rows[-1]
            last_ri = last_row["index"]
-            last_cells = [c for c in cells if c.get("row_index") == last_ri]
-            content_last = [
-                c for c in last_cells
-                if c.get("col_type", "").startswith("column_")
-                and c.get("col_type") != "column_1"
-            ]
-            if len(content_last) == 1 and len(last_cells) == 1:
-                footer_rows.append({
-                    "row_index": last_ri,
-                    "text": (content_last[0].get("text") or "").strip(),
-                    "bbox_pct": content_last[0].get("bbox_pct", {}),
-                })
+            last_cells = [c for c in z["cells"]
+                          if c.get("row_index") == last_ri]
+            if len(last_cells) == 1:
+                text = (last_cells[0].get("text") or "").strip()
+                # Not IPA (no real IPA symbols) and not a heading
+                has_real_ipa = any(c in _REAL_IPA_CHARS_SET for c in text)
+                if text and not has_real_ipa and last_cells[0].get("col_type") != "heading":
+                    footer_rows.append({
+                        "row_index": last_ri,
+                        "text": text,
+                        "bbox_pct": last_cells[0].get("bbox_pct", {}),
+                    })

-        # Remove page_ref and footer cells/rows from the table
-        remove_ris = set()
-        for pr in page_refs:
-            remove_ris.add(pr["row_index"])
-        for fr in footer_rows:
-            remove_ris.add(fr["row_index"])
+        # Mark footer rows (keep in table, just tag for frontend)
+        if footer_rows:
+            footer_ris = {fr["row_index"] for fr in footer_rows}
+            for r in z["rows"]:
+                if r["index"] in footer_ris:
+                    r["is_footer"] = True
+            for c in z["cells"]:
+                if c.get("row_index") in footer_ris:
+                    c["col_type"] = "footer"

-        if remove_ris:
-            z["cells"] = [c for c in cells if c.get("row_index") not in remove_ris]
-            z["rows"] = [r for r in rows if r["index"] not in remove_ris]
+        if page_refs or footer_rows:
            logger.info(
                "Extracted %d page_refs + %d footer rows from zone %d",
                len(page_refs), len(footer_rows), z.get("zone_index", 0),
--- a/klausur-service/backend/tests/test_grid_editor_api.py
+++ b/klausur-service/backend/tests/test_grid_editor_api.py
@@ -510,6 +510,23 @@ class TestGarbledIpaDetection:
        assert "klˈəʊs" in fixed   # close IPA
        assert "dˈaʊn" in fixed    # down IPA — must NOT be skipped

+    def test_continuation_skips_words_with_inline_ipa(self):
+        """'beat [bˈiːt] , beat, beaten' → continuation only for 'beaten'."""
+        fixed = fix_ipa_continuation_cell(
+            "[bi:tan]", "beat [bˈiːt] , beat, beaten", pronunciation="british",
+        )
+        # Should only have IPA for "beaten", NOT for "beat" (already inline)
+        assert "bˈiːtən" in fixed
+        assert fixed.count("bˈiːt") == 0 or fixed == "[bˈiːtən]"
+
+    def test_continuation_bracket_at_end_returns_inline(self):
+        """'the Highlands [ˈhaɪləndz]' → return inline IPA, not IPA for 'the'."""
+        fixed = fix_ipa_continuation_cell(
+            "'hailandz", "the Highlands [ˈhaɪləndz]", pronunciation="british",
+        )
+        assert fixed == "[ˈhaɪləndz]"
+        assert "ðə" not in fixed  # "the" must NOT get IPA
+
    def test_headword_with_brackets_not_continuation(self):
        """'employee [im'ploi:]' has a headword outside brackets → not garbled.
Author	SHA1	Message	Date
Benjamin Admin	a579c31ddb	Fix IPA continuation: skip words with inline IPA, recover emptied cells Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m46s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details Three fixes: 1. fix_ipa_continuation_cell: when headword has inline IPA like "beat [bˈiːt] , beat, beaten", only generate IPA for uncovered words (beaten), not words already shown (beat). When bracket is at end like "the Highlands [ˈhaɪləndz]", return inline IPA directly. 2. Step 5d: recover garbled IPA from word_boxes when Step 5c emptied the cell text (e.g. "[n, nn]" → ""). 3. Added 2 tests for inline IPA behavior (35 total). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 09:31:54 +01:00
Benjamin Admin	0f9c0d2ad0	Keep footer rows in table, mark with is_footer + col_type=footer Footer rows like "two hundred and twelve" are no longer removed from the grid. Instead they stay in cells/rows and get tagged so the frontend can render them differently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 09:08:25 +01:00
Benjamin Admin	278067fe20	Fix page_ref extraction: only extract cells matching page-ref pattern Column_1 cells like "to" (infinitive markers) were incorrectly extracted as page_refs. Now only cells matching p.70, ,.65, or bare digits are treated as page references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:55:55 +01:00
Benjamin Admin	d76fb2a9c8	Fix page_ref + footer extraction: extract individual cells, skip IPA footers Step 5g now extracts column_1 cells individually as page_refs (instead of requiring the whole row to be column_1-only), and footer detection skips rows containing real IPA Unicode symbols to avoid false positives on IPA continuation rows like [sˈiː] – [sˈɔː] – [sˈiːn]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 08:47:39 +01:00