breakpilot-lehrer

Files

T

Benjamin Admin 46c8c28d34 fix: border strip pre-filter + 3-column detection for vocabulary tables

The border strip filter (Step 4e) used the LARGEST x-gap which incorrectly
removed base words along with edge artifacts. Now uses a two-stage approach:
1. _filter_border_strip_words() pre-filters raw words BEFORE column detection,
   scanning from the page edge inward to find the FIRST significant gap (>30px)
2. Step 4e runs as fallback only when pre-filter didn't apply

Session 4233 now correctly detects 3 columns (base word | oder | synonyms)
instead of 2. Threshold raised from 15% to 20% to handle pages with many
edge artifacts. All 4 ground-truth sessions pass regression.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-21 21:01:43 +01:00

data

feat(ocr-pipeline): British/American IPA pronunciation choice

2026-03-01 11:08:52 +01:00

mail

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

migrations

feat: add Excel-like grid editor for OCR overlay (Kombi mode step 6)

2026-03-14 23:41:03 +01:00

models

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

policies

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

routes

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

services

fix: increase PaddleOCR remote timeout to 120s for large scans

2026-03-12 13:41:39 +01:00

tests

fix: border strip pre-filter + 3-column detection for vocabulary tables

2026-03-21 21:01:43 +01:00

admin_api.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

config.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

country_metadata.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

cv_box_detect.py

feat: run shading-based box detection alongside line detection

2026-03-16 08:12:52 +01:00

cv_cell_grid.py

fix: word_boxes auch fuer breite Spalten (Full-Page OCR) speichern

2026-03-11 20:41:29 +01:00

cv_color_detect.py

Fix bullet overlap disambiguation + raise red threshold to 90

2026-03-20 18:21:00 +01:00

cv_graphic_detect.py

fix: robust colored-text detection in graphic filter

2026-03-17 18:09:16 +01:00

cv_layout.py

feat: box-aware column detection — exclude box content from global columns

2026-03-16 18:42:46 +01:00

cv_ocr_engines.py

Fix grammar bracket detection: split on spaces too, not just slashes

2026-03-20 11:45:35 +01:00

cv_preprocessing.py

fix: swap 90°/270° rotation direction in orientation detection

2026-03-17 16:39:15 +01:00

cv_review.py

fix: _group_words_into_lines nach cv_ocr_engines.py verschieben

2026-03-09 15:24:56 +01:00

cv_vocab_pipeline.py

feat: Words-First Grid Builder (bottom-up alternative zu cell_grid_v2)

2026-03-12 06:46:05 +01:00

cv_vocab_types.py

Vertical zone split: detect divider lines and create independent sub-zones

2026-03-20 16:38:12 +01:00

cv_words_first.py

fix: sort word_boxes in reading order (Y-grouped, then X-sorted)

2026-03-17 10:41:30 +01:00

dsfa_corpus_ingestion.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

dsfa_rag_api.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

eh_pipeline.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

eh_templates.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

embedding_client.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

full_compliance_pipeline.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

full_reingestion.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

github_crawler.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

grid_editor_api.py

fix: border strip pre-filter + 3-column detection for vocabulary tables

2026-03-21 21:01:43 +01:00

handwriting_htr_api.py

feat(klausur): Handschrift entfernen + Klausur-HTR implementiert

2026-03-03 12:04:26 +01:00

hybrid_search.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

hybrid_vocab_extractor.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

hyde.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

legal_corpus_api.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

legal_corpus_ingestion.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

legal_corpus_robust.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

legal_templates_ingestion.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

main.py

feat: add Excel-like grid editor for OCR overlay (Kombi mode step 6)

2026-03-14 23:41:03 +01:00

metrics_db.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

migrate_rag_chunks.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

minio_storage.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

nibis_ingestion.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

nru_worksheet_generator.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

ocr_labeling_api.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

ocr_pipeline_api.py

Add Ground Truth regression test system for OCR pipeline

2026-03-18 13:46:48 +01:00

ocr_pipeline_auto.py

refactor: split ocr_pipeline_api.py (5426 lines) into 8 modules

2026-03-18 08:42:00 +01:00

ocr_pipeline_common.py

refactor: split ocr_pipeline_api.py (5426 lines) into 8 modules

2026-03-18 08:42:00 +01:00

ocr_pipeline_geometry.py

Invalidate grid_editor_result when exclude regions change

2026-03-19 09:19:09 +01:00

ocr_pipeline_ocr_merge.py

refactor: split ocr_pipeline_api.py (5426 lines) into 8 modules

2026-03-18 08:42:00 +01:00

ocr_pipeline_overlays.py

Preserve alphabetic marker columns, broaden junk filter, enable IPA in grid

2026-03-18 11:08:23 +01:00

ocr_pipeline_postprocess.py

refactor: split ocr_pipeline_api.py (5426 lines) into 8 modules

2026-03-18 08:42:00 +01:00

ocr_pipeline_regression.py

Add GT button to OCR overlay, prominent category picker, track pipeline

2026-03-18 14:49:02 +01:00

ocr_pipeline_rows.py

refactor: split ocr_pipeline_api.py (5426 lines) into 8 modules

2026-03-18 08:42:00 +01:00

ocr_pipeline_session_store.py

Add Ground Truth regression test system for OCR pipeline

2026-03-18 13:46:48 +01:00

ocr_pipeline_sessions.py

refactor: split ocr_pipeline_api.py (5426 lines) into 8 modules

2026-03-18 08:42:00 +01:00

ocr_pipeline_words.py

refactor: split ocr_pipeline_api.py (5426 lines) into 8 modules

2026-03-18 08:42:00 +01:00

orientation_crop_api.py

feat: auto-detect multi-page spreads and split into sub-sessions

2026-03-17 16:34:06 +01:00

page_crop.py

Fix spine shadow false positives: require dark valley, brightness rise, trim convolution edges

2026-03-19 08:23:50 +01:00

pdf_export.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

pdf_extraction.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

pipeline_checkpoints.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

pyproject.toml

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

qdrant_service.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

rag_evaluation.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

rbac.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

requirements.txt

feat: OCR pipeline v2.1 – narrow column OCR, dewarp automation, Fabric.js editor

2026-03-03 22:44:14 +01:00

reranker.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

self_rag.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

storage.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

template_sources.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

tesseract_vocab_extractor.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

training_api.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

training_export_service.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

trocr_api.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

upload_api.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

vocab_session_store.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

vocab_worksheet_api.py

fix: Edge-Gaps in _split_broad_columns ignorieren + return-Tuple bei leerem Ergebnis

2026-03-07 22:16:29 +01:00

worksheet_cleanup_api.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

worksheet_editor_api.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

zeugnis_api.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

zeugnis_crawler.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

zeugnis_models.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00

zeugnis_seed_data.py

Initial commit: breakpilot-lehrer - Lehrer KI Platform

2026-02-11 23:47:26 +01:00