Benjamin_Boenisch/breakpilot-core - breakpilot-core - Gitea: Git with a cup of tea

Benjamin_Boenisch/breakpilot-core

T

Benjamin Admin ddad58f607 fix(rag): strip HTML tags before chunking + D5 re-ingestion scripts

HTML files from gesetze-im-internet.de were decoded as raw UTF-8, keeping
<div>/<p> tags intact. The legal chunker regex requires § at line start,
which never matched inside HTML tags → 0% section metadata for HTML docs.

Fix: detect HTML content and strip tags before sending to embedding
service. Block elements become newlines, entities are decoded.
§ signs now appear at line starts → section detection works.

Also adds D5 re-ingestion scripts (reingest_d5.py + config) for
batch re-processing of all documents in Qdrant collections.

27 rag-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-02 08:18:25 +02:00

[split-required] [guardrail-change] Enforce 500 LOC budget across all services

2026-04-27 00:09:30 +02:00

.gitea/workflows

feat: git SHA version badge in admin, fix finanzplan caching, drop gitea remote

2026-04-17 10:47:51 +02:00

[split-required] [guardrail-change] Enforce 500 LOC budget across all services

2026-04-27 00:09:30 +02:00

Remove re-export shim from keycloak_auth.py, update consumer imports

2026-04-27 00:13:30 +02:00

billing-service

Initial commit: breakpilot-core - Shared Infrastructure

2026-02-11 23:47:13 +01:00

Initial commit: breakpilot-core - Shared Infrastructure

2026-02-11 23:47:13 +01:00

consent-service

[split-required] [guardrail-change] Enforce 500 LOC budget across all services

2026-04-27 00:09:30 +02:00

control-pipeline

fix(rag): strip HTML tags before chunking + D5 re-ingestion scripts

2026-05-02 08:18:25 +02:00

docs: replace all Coolify references with Orca across core repo

2026-04-17 10:39:47 +02:00

document-templates

feat: DSFA Generator — FISA 702 Risiken bei US-Cloud-Providern

2026-04-15 00:47:21 +02:00

embedding-service

feat(pipeline): structural metadata end-to-end (Blocks D2-D4)

2026-05-01 20:34:00 +02:00

Initial commit: breakpilot-core - Shared Infrastructure

2026-02-11 23:47:13 +01:00

fix: Ensure public/ dir exists in Docker build for levis-holzbau

2026-03-11 10:06:54 +01:00

fix: add proxy_read_timeout 300s to admin-compliance location block

2026-04-29 11:23:02 +02:00

night-scheduler

Initial commit: breakpilot-core - Shared Infrastructure

2026-02-11 23:47:13 +01:00

paddleocr-service

fix: downgrade to PaddleOCR 2.x — 3.x uses too much RAM on CPU

2026-03-13 19:13:33 +01:00

Merge remote-tracking branch 'gitea/main'

2026-04-27 13:14:54 +02:00

fix(rag): strip HTML tags before chunking + D5 re-ingestion scripts

2026-05-02 08:18:25 +02:00

[split-required] [guardrail-change] Enforce 500 LOC budget across all services

2026-04-27 00:09:30 +02:00

feat: Add DevSecOps tools, Woodpecker proxy, Vault persistent storage, pitch-deck annex slides

2026-02-17 15:42:43 +01:00

[split-required] [guardrail-change] Enforce 500 LOC budget across all services

2026-04-27 00:09:30 +02:00

.env.coolify.example

Add QDRANT_API_KEY support to rag-service

2026-03-13 10:16:59 +01:00

.env.example

chore: Woodpecker CI entfernt — nur noch Gitea Actions

2026-03-05 23:05:08 +01:00

.gitignore

docs(mcp-server): add README + gitignore .mcp.json

2026-04-13 10:36:54 +02:00

docker-compose.coolify.yml

feat(pitch-deck): admin UI for investor + financial-model management (#3 )

2026-04-07 10:36:16 +00:00

docker-compose.hetzner.yml

fix(qdrant): Increase ulimits for RocksDB (Too many open files)

2026-03-11 22:31:16 +01:00

docker-compose.yml

fix(docker): re-enable healthcheck after dedup completion

2026-05-01 08:39:57 +02:00

mkdocs.yml

feat: Add DevSecOps tools, Woodpecker proxy, Vault persistent storage, pitch-deck annex slides

2026-02-17 15:42:43 +01:00

update_pitch_deck_maschinenbau.sql

feat(pitch-deck): pivot to Maschinen- und Anlagenbau target market

2026-02-17 21:42:29 +01:00

Languages

Python 38.3%

TypeScript 37.8%

Go 18.9%

HTML 3.2%

Shell 0.7%

Other 1.1%