Compare commits
6 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 989d9f6f91 | |||
| 4c99773fa1 | |||
| b83c3e6e00 | |||
| a1f425d43a | |||
| 23c6ac6f32 | |||
| d82f86fc95 |
+3
-2
@@ -130,10 +130,11 @@ rsync -avz --exclude node_modules --exclude .next --exclude .git \
|
||||
|
||||
**breakpilot-core MUSS laufen!** Dieses Projekt nutzt Core-Services:
|
||||
- Valkey (Session-Cache)
|
||||
- Vault (Secrets)
|
||||
- RAG-Service (Vektorsuche fuer Compliance-Dokumente)
|
||||
- Nginx (Reverse Proxy)
|
||||
|
||||
Secrets liegen in Infisical (`secrets.meghsakha.com`); die Projektverknuepfung steht in `.infisical.json`. Lokal mit `infisical run --env=dev -- docker compose up` (oder `make dev`) starten — `.env`/`.env.local` werden nicht mehr verwendet.
|
||||
|
||||
**Externe Services (Production):**
|
||||
- PostgreSQL 17 (sslmode=require) — Schemas: `compliance`, `public`
|
||||
- Qdrant @ `qdrant-dev.breakpilot.ai` (HTTPS, API-Key)
|
||||
@@ -316,7 +317,7 @@ ssh macmini "/usr/local/bin/docker compose -f /Users/benjaminadmin/Projekte/brea
|
||||
|
||||
### 5. Sensitive Dateien
|
||||
**NIEMALS aendern oder committen:**
|
||||
- `.env`, `.env.local`, Vault-Tokens, SSL-Zertifikate
|
||||
- `.env`, `.env.local`, Infisical-Tokens, SSL-Zertifikate
|
||||
- `*.pdf`, `*.docx`, kompilierte Binaries, grosse Medien
|
||||
|
||||
---
|
||||
|
||||
@@ -92,7 +92,7 @@ Wenn Hochrisiko:
|
||||
|
||||
- [ ] **Transit:** TLS 1.3 für alle Verbindungen
|
||||
- [ ] **Rest:** Datenbank-Verschlüsselung
|
||||
- [ ] **Secrets:** Vault für Credentials
|
||||
- [ ] **Secrets:** Infisical (`secrets.meghsakha.com`) für Credentials
|
||||
|
||||
### Zugriffskontrollen
|
||||
|
||||
|
||||
@@ -136,12 +136,14 @@ jobs:
|
||||
runs-on: docker
|
||||
needs: detect-changes
|
||||
if: github.event_name == 'pull_request' && needs.detect-changes.outputs.sdk == 'true'
|
||||
container: golangci/golangci-lint:v1.62-alpine
|
||||
container: golangci/golangci-lint:v1.64.8-alpine
|
||||
steps:
|
||||
- name: Checkout
|
||||
run: |
|
||||
apk add --no-cache git
|
||||
git clone --depth 1 --branch ${GITHUB_HEAD_REF:-${GITHUB_REF_NAME}} ${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}.git .
|
||||
# Full clone so `main` is a local ref — new-from-merge-base needs the merge base.
|
||||
git clone ${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}.git .
|
||||
git checkout ${GITHUB_HEAD_REF:-${GITHUB_REF_NAME}}
|
||||
- name: Lint ai-compliance-sdk
|
||||
run: |
|
||||
[ -d "ai-compliance-sdk" ] || exit 0
|
||||
|
||||
@@ -0,0 +1,5 @@
|
||||
{
|
||||
"workspaceId": "996bda36-9e01-4071-ae8d-69a9f9ff5a23",
|
||||
"defaultEnvironment": "",
|
||||
"gitBranchToEnvironmentMapping": null
|
||||
}
|
||||
@@ -0,0 +1,157 @@
|
||||
# Infisical Setup for Local Development
|
||||
|
||||
This is the per-developer onboarding for accessing the `breakpilot-compliance` secrets while developing locally. Once this is done, **everything you launch through `make dev` (or `infisical run …`) gets the dev secrets injected as environment variables** — including any Claude Code session that spawns those commands.
|
||||
|
||||
Secrets live in the self-hosted Infisical instance at **`secrets.meghsakha.com`**. The project link is committed in `.infisical.json`, so you don't need to know the project ID.
|
||||
|
||||
---
|
||||
|
||||
## 1. Install the Infisical CLI
|
||||
|
||||
**macOS (recommended):**
|
||||
|
||||
```bash
|
||||
brew install infisical/get-cli/infisical
|
||||
```
|
||||
|
||||
**Other platforms / manual install:**
|
||||
|
||||
See <https://infisical.com/docs/cli/overview>. Verify with:
|
||||
|
||||
```bash
|
||||
infisical --version
|
||||
# infisical version 0.43.x (or newer)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Log in to the self-hosted instance
|
||||
|
||||
```bash
|
||||
infisical login --domain https://secrets.meghsakha.com
|
||||
```
|
||||
|
||||
This opens a browser for SSO. The login is persisted to your OS keychain — you only do this once per machine.
|
||||
|
||||
Sanity check:
|
||||
|
||||
```bash
|
||||
cd ~/projects/breakpilot-compliance # wherever you cloned the repo
|
||||
infisical --domain https://secrets.meghsakha.com secrets --env=dev
|
||||
```
|
||||
|
||||
You should see a table of secret names + values. If you get an auth error, re-run `infisical login`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Verify the project link
|
||||
|
||||
The repo already contains `.infisical.json` pointing at the `breakpilot-compliance` project:
|
||||
|
||||
```bash
|
||||
cat .infisical.json
|
||||
# { "workspaceId": "996bda36-9e01-4071-ae8d-69a9f9ff5a23", ... }
|
||||
```
|
||||
|
||||
If the file is missing (rare — only if you reset the repo), recreate it:
|
||||
|
||||
```bash
|
||||
infisical init --domain https://secrets.meghsakha.com
|
||||
```
|
||||
|
||||
Pick the `breakpilot-compliance` project from the picker.
|
||||
|
||||
---
|
||||
|
||||
## 4. Launch the stack
|
||||
|
||||
```bash
|
||||
make dev
|
||||
```
|
||||
|
||||
This runs `infisical run --env=dev -- docker compose up`. Every service in the compose stack sees its secrets as normal env vars — no `.env` file ever touches disk.
|
||||
|
||||
Other targets:
|
||||
|
||||
| Target | What it does |
|
||||
|--------|--------------|
|
||||
| `make dev-build` | Same as `make dev` but rebuilds images first |
|
||||
| `make dev-down` | Stop the stack (no secrets needed) |
|
||||
| `make dev-logs` | Tail logs |
|
||||
| `make dev-ps` | List running containers |
|
||||
| `make secrets` | Print all secrets in `dev` (read-only) |
|
||||
| `make secrets-set KEY=FOO VALUE=bar` | Add or update a secret in `dev` |
|
||||
|
||||
To target a different environment:
|
||||
|
||||
```bash
|
||||
make dev ENV=staging
|
||||
make secrets ENV=prod
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Using secrets from Claude Code
|
||||
|
||||
When Claude Code runs commands in this repo via its Bash tool, the commands inherit your shell's environment. Two patterns:
|
||||
|
||||
**Pattern A — let Claude launch the stack normally**
|
||||
|
||||
Claude just runs `make dev`. The Infisical CLI inside that command resolves secrets at run time and passes them to docker compose. Claude doesn't see plaintext secrets in its context, but the running services do.
|
||||
|
||||
**Pattern B — let Claude run a one-off script with secrets**
|
||||
|
||||
If Claude needs to execute a Python/Go script that requires secrets, wrap the command:
|
||||
|
||||
```bash
|
||||
infisical run --env=dev -- python scripts/some_one_off.py
|
||||
```
|
||||
|
||||
This works for any subprocess: pytest, alembic, go run, npm scripts. If Claude proposes a command that reads env vars and runs raw, ask it to wrap it in `infisical run --env=dev --` first.
|
||||
|
||||
**What Claude should not do:**
|
||||
|
||||
- `infisical export --env=dev > .env` — defeats the whole point and the `.gitignore` will still try to keep the file out.
|
||||
- `infisical secrets get KEY --env=dev --raw` and pasting the value into a code edit — secrets must stay out of the repo.
|
||||
|
||||
If you want Claude to never accidentally dump secrets, add this to your `.claude/settings.json` permissions (project-level or user-level):
|
||||
|
||||
```json
|
||||
{
|
||||
"permissions": {
|
||||
"deny": [
|
||||
"Bash(infisical export*)",
|
||||
"Bash(infisical secrets get*)"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Fix |
|
||||
|---------|-----|
|
||||
| `please either run infisical init or pass --projectId` | `.infisical.json` is missing or unreadable — re-run `infisical init` |
|
||||
| `unauthorized` / `please log in` | Re-run `infisical login --domain https://secrets.meghsakha.com` |
|
||||
| `make dev` says secret is empty | Check the name in `make secrets` matches what docker-compose expects, then update the service config or rename the secret in Infisical |
|
||||
| Browser SSO doesn't open | Use `infisical login --domain https://secrets.meghsakha.com --method=user` and paste the URL manually |
|
||||
|
||||
---
|
||||
|
||||
## What the dev env contains
|
||||
|
||||
Run `make secrets` to see the live list. As of this writing the dev env includes (at minimum):
|
||||
|
||||
- `BREAKPILOT_DB_PASSWORD`
|
||||
- `BREAKPILOT_QDRANT_API_KEY`
|
||||
- `LITELLM_API_KEY`
|
||||
|
||||
Every other variable in `.env.example` either has a sane default in `docker-compose.yml` or needs to be added to Infisical. To add one:
|
||||
|
||||
```bash
|
||||
make secrets-set KEY=ANTHROPIC_API_KEY VALUE=sk-ant-xxxx
|
||||
```
|
||||
|
||||
Or via the web UI: <https://secrets.meghsakha.com>.
|
||||
@@ -0,0 +1,57 @@
|
||||
# breakpilot-compliance — developer workflow
|
||||
#
|
||||
# Secrets are managed in Infisical (secrets.meghsakha.com). The project
|
||||
# link lives in .infisical.json. To get started:
|
||||
# 1) infisical login --domain https://secrets.meghsakha.com (once per machine)
|
||||
# 2) make dev
|
||||
#
|
||||
# .env / .env.local are NOT used in this repo anymore. Anything that needs
|
||||
# secrets MUST be launched through `infisical run` so the values come from
|
||||
# the secrets store instead of disk.
|
||||
|
||||
INFISICAL ?= infisical
|
||||
INFISICAL_DOMAIN ?= https://secrets.meghsakha.com
|
||||
ENV ?= dev
|
||||
|
||||
INFISICAL_RUN := $(INFISICAL) --domain $(INFISICAL_DOMAIN) run --env=$(ENV) --
|
||||
INFISICAL_SECRETS := $(INFISICAL) --domain $(INFISICAL_DOMAIN) secrets --env=$(ENV)
|
||||
|
||||
.PHONY: help dev dev-build dev-down dev-logs dev-ps secrets secrets-set check-loc
|
||||
|
||||
help:
|
||||
@echo "Targets:"
|
||||
@echo " dev Start the full compose stack with secrets injected from Infisical"
|
||||
@echo " dev-build Same as dev, but force a rebuild first"
|
||||
@echo " dev-down Stop the compose stack (no secrets needed)"
|
||||
@echo " dev-logs Tail logs from all services"
|
||||
@echo " dev-ps Show running containers"
|
||||
@echo " secrets List all secrets in the current env ($(ENV))"
|
||||
@echo " secrets-set Set a secret (KEY=... VALUE=...)"
|
||||
@echo " check-loc Run the 500-line LOC guard"
|
||||
|
||||
dev:
|
||||
$(INFISICAL_RUN) docker compose up
|
||||
|
||||
dev-build:
|
||||
$(INFISICAL_RUN) docker compose up --build
|
||||
|
||||
dev-down:
|
||||
docker compose down
|
||||
|
||||
dev-logs:
|
||||
docker compose logs -f
|
||||
|
||||
dev-ps:
|
||||
docker compose ps
|
||||
|
||||
secrets:
|
||||
$(INFISICAL_SECRETS)
|
||||
|
||||
secrets-set:
|
||||
@if [ -z "$(KEY)" ] || [ -z "$(VALUE)" ]; then \
|
||||
echo "Usage: make secrets-set KEY=MY_KEY VALUE=my_value"; exit 1; \
|
||||
fi
|
||||
$(INFISICAL) --domain $(INFISICAL_DOMAIN) secrets set $(KEY)=$(VALUE) --env=$(ENV)
|
||||
|
||||
check-loc:
|
||||
bash scripts/check-loc.sh
|
||||
@@ -42,23 +42,26 @@ All containers share the external `breakpilot-network` Docker network and depend
|
||||
|
||||
## Quick Start
|
||||
|
||||
**Prerequisites:** Docker, Go 1.24+, Python 3.12+, Node.js 20+
|
||||
**Prerequisites:** Docker, Go 1.24+, Python 3.12+, Node.js 20+, [Infisical CLI](https://infisical.com/docs/cli/overview)
|
||||
|
||||
```bash
|
||||
git clone ssh://git@gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-compliance.git
|
||||
cd breakpilot-compliance
|
||||
|
||||
# Copy and populate secrets (never commit .env)
|
||||
cp .env.example .env
|
||||
# One-time per machine: log in to the self-hosted Infisical instance
|
||||
infisical login --domain https://secrets.meghsakha.com
|
||||
|
||||
# Start all services
|
||||
docker compose up -d
|
||||
# Start the full stack with secrets injected from Infisical (env=dev)
|
||||
make dev
|
||||
```
|
||||
|
||||
Secrets are pulled from Infisical (`secrets.meghsakha.com`) at runtime; `.env` files are not used. See [INFISICAL_SETUP.md](./INFISICAL_SETUP.md) for full onboarding, and `make help` for the rest of the targets (`dev-build`, `dev-down`, `secrets`, `secrets-set`).
|
||||
|
||||
For the Orca/Hetzner production target (x86_64), use the override:
|
||||
|
||||
```bash
|
||||
docker compose -f docker-compose.yml -f docker-compose.hetzner.yml up -d
|
||||
make dev ENV=prod # or:
|
||||
infisical run --env=prod -- docker compose -f docker-compose.yml -f docker-compose.hetzner.yml up -d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -46,6 +46,28 @@ export interface CorpusOverview {
|
||||
totals: { documents: number; catalog_sources: number }
|
||||
}
|
||||
|
||||
// --- Ingested legal-corpus structure (from the vector store, via the Go SDK).
|
||||
// Shows WHAT each eur-lex act consists of (articles/annexes/recitals), so the
|
||||
// ingested corpus is not a black box for developers. ---
|
||||
export interface LegalActStructure {
|
||||
regulation_short: string
|
||||
regulation_name: string
|
||||
articles: number
|
||||
annexes: number
|
||||
recitals: number
|
||||
chunks: number
|
||||
}
|
||||
|
||||
export interface LegalCorpus {
|
||||
regulations: LegalActStructure[]
|
||||
totals: {
|
||||
regulations: number
|
||||
articles: number
|
||||
annexes: number
|
||||
recitals: number
|
||||
}
|
||||
}
|
||||
|
||||
// --- Korpus-Dokumente: gruppieren nach Art (Gesetz/Leitfaden/Standard/Urteil)
|
||||
// + Herausgeber-Familie (DSK, EDPB, OWASP, NIST …). Deterministisch, pure. ---
|
||||
interface DocCat {
|
||||
|
||||
@@ -3,6 +3,7 @@ import Link from 'next/link'
|
||||
import {
|
||||
type UseCaseRow,
|
||||
type CorpusOverview,
|
||||
type LegalCorpus,
|
||||
licenseTierBadgeClass,
|
||||
commercialBadgeClass,
|
||||
groupUseCases,
|
||||
@@ -11,28 +12,46 @@ import {
|
||||
|
||||
const BACKEND_URL =
|
||||
process.env.COMPLIANCE_BACKEND_URL || 'http://backend-compliance:8002'
|
||||
// The legal-corpus structure comes from the Go SDK (it owns the vector store).
|
||||
const SDK_URL = process.env.SDK_URL || 'http://ai-compliance-sdk:8090'
|
||||
|
||||
export const dynamic = 'force-dynamic'
|
||||
|
||||
// Fetched from the SDK and isolated in its own try/catch so a vector-store
|
||||
// hiccup degrades to "no structure shown" instead of blanking the whole page.
|
||||
async function fetchLegalCorpus(): Promise<LegalCorpus | null> {
|
||||
try {
|
||||
const res = await fetch(`${SDK_URL}/sdk/v1/rag/legal-corpus`, {
|
||||
cache: 'no-store',
|
||||
})
|
||||
return res.ok ? await res.json() : null
|
||||
} catch {
|
||||
return null
|
||||
}
|
||||
}
|
||||
|
||||
async function getData(): Promise<{
|
||||
useCases: UseCaseRow[]
|
||||
corpus: CorpusOverview | null
|
||||
legalCorpus: LegalCorpus | null
|
||||
}> {
|
||||
try {
|
||||
const [ucRes, corpusRes] = await Promise.all([
|
||||
const [ucRes, corpusRes, legalCorpus] = await Promise.all([
|
||||
fetch(`${BACKEND_URL}/api/compliance/v1/controls/use-cases`, {
|
||||
cache: 'no-store',
|
||||
}),
|
||||
fetch(`${BACKEND_URL}/api/compliance/v1/controls/corpus`, {
|
||||
cache: 'no-store',
|
||||
}),
|
||||
fetchLegalCorpus(),
|
||||
])
|
||||
return {
|
||||
useCases: ucRes.ok ? await ucRes.json() : [],
|
||||
corpus: corpusRes.ok ? await corpusRes.json() : null,
|
||||
legalCorpus,
|
||||
}
|
||||
} catch {
|
||||
return { useCases: [], corpus: null }
|
||||
return { useCases: [], corpus: null, legalCorpus: null }
|
||||
}
|
||||
}
|
||||
|
||||
@@ -46,7 +65,7 @@ function Stat({ label, value }: { label: string; value: string | number }) {
|
||||
}
|
||||
|
||||
export default async function CoveragePage() {
|
||||
const { useCases, corpus } = await getData()
|
||||
const { useCases, corpus, legalCorpus } = await getData()
|
||||
const groups = groupUseCases(useCases)
|
||||
const totalRelevant = useCases.reduce((s, u) => s + u.atom_relevant, 0)
|
||||
const totalAtoms = useCases.reduce((s, u) => s + u.atom_total, 0)
|
||||
@@ -221,6 +240,67 @@ export default async function CoveragePage() {
|
||||
</div>
|
||||
</section>
|
||||
|
||||
{legalCorpus?.regulations?.length ? (
|
||||
<section className="space-y-2">
|
||||
<h2 className="text-lg font-semibold text-gray-900">
|
||||
Ingestierter Rechtskorpus – Struktur ({legalCorpus.totals.regulations}{' '}
|
||||
Rechtsakte)
|
||||
</h2>
|
||||
<p className="text-xs text-gray-500">
|
||||
Woraus jeder ingestierte eur-lex-Rechtsakt tatsächlich besteht:
|
||||
Artikel (§), Anhänge, Erwägungsgründe und retrievbare Chunks — direkt
|
||||
aus dem Vektorspeicher, damit kein Black-Box-Korpus entsteht.
|
||||
</p>
|
||||
<div className="overflow-auto rounded-lg border border-gray-200">
|
||||
<table className="min-w-full divide-y divide-gray-200 text-sm">
|
||||
<thead className="bg-gray-50 text-left text-xs uppercase text-gray-500">
|
||||
<tr>
|
||||
<th className="px-4 py-2">Rechtsakt</th>
|
||||
<th className="px-4 py-2 text-right">Artikel (§)</th>
|
||||
<th className="px-4 py-2 text-right">Anhänge</th>
|
||||
<th className="px-4 py-2 text-right">Erwägungsgründe</th>
|
||||
<th className="px-4 py-2 text-right">Chunks</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody className="divide-y divide-gray-100 bg-white">
|
||||
{legalCorpus.regulations.map((r) => (
|
||||
<tr key={r.regulation_short}>
|
||||
<td className="px-4 py-2 text-gray-900">
|
||||
<span className="font-medium">{r.regulation_short}</span>
|
||||
{r.regulation_name !== r.regulation_short ? (
|
||||
<span className="ml-2 text-xs text-gray-500">
|
||||
{r.regulation_name}
|
||||
</span>
|
||||
) : null}
|
||||
</td>
|
||||
<td className="px-4 py-2 text-right font-semibold">
|
||||
{r.articles.toLocaleString('de-DE')}
|
||||
</td>
|
||||
<td className="px-4 py-2 text-right">
|
||||
{r.annexes > 0 ? (
|
||||
r.annexes.toLocaleString('de-DE')
|
||||
) : (
|
||||
<span className="text-gray-300">—</span>
|
||||
)}
|
||||
</td>
|
||||
<td className="px-4 py-2 text-right text-gray-500">
|
||||
{r.recitals > 0 ? (
|
||||
r.recitals.toLocaleString('de-DE')
|
||||
) : (
|
||||
<span className="text-gray-300">—</span>
|
||||
)}
|
||||
</td>
|
||||
<td className="px-4 py-2 text-right text-gray-500">
|
||||
{r.chunks.toLocaleString('de-DE')}
|
||||
</td>
|
||||
</tr>
|
||||
))}
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</section>
|
||||
) : null}
|
||||
|
||||
{corpus?.license_catalog?.length ? (
|
||||
<section className="space-y-2">
|
||||
<h2 className="text-lg font-semibold text-gray-900">
|
||||
|
||||
@@ -55,8 +55,7 @@ linters-settings:
|
||||
rules:
|
||||
- name: exported
|
||||
arguments:
|
||||
- checkPrivateReceivers: false
|
||||
- disableStutteringCheck: true
|
||||
- disableStutteringCheck
|
||||
- name: error-return
|
||||
- name: increment-decrement
|
||||
- name: var-declaration
|
||||
@@ -83,6 +82,6 @@ issues:
|
||||
max-issues-per-linter: 50
|
||||
max-same-issues: 5
|
||||
|
||||
# New code only: don't fail on pre-existing issues in files we haven't touched.
|
||||
# Remove this once a clean baseline is established.
|
||||
new: false
|
||||
# New code only: lint lines changed vs main, so pre-existing debt doesn't fail CI.
|
||||
# Needs the go-lint job to clone with a local `main` ref (see .gitea/workflows/ci.yaml).
|
||||
new-from-merge-base: main
|
||||
|
||||
@@ -75,9 +75,10 @@ func (h *RAGHandlers) Search(c *gin.Context) {
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, gin.H{
|
||||
"query": req.Query,
|
||||
"results": results,
|
||||
"count": len(results),
|
||||
"query": req.Query,
|
||||
"results": results,
|
||||
"count": len(results),
|
||||
"assessment": ucca.Assess(results),
|
||||
})
|
||||
}
|
||||
|
||||
@@ -206,3 +207,32 @@ func (h *RAGHandlers) HandleScrollChunks(c *gin.Context) {
|
||||
"total": len(chunks),
|
||||
})
|
||||
}
|
||||
|
||||
// LegalCorpusStructure returns the composition (distinct articles, annexes,
|
||||
// recitals + chunk count) of every ingested eur-lex legal act, so the coverage
|
||||
// page can show WHAT was ingested instead of just the act name.
|
||||
// GET /sdk/v1/rag/legal-corpus
|
||||
func (h *RAGHandlers) LegalCorpusStructure(c *gin.Context) {
|
||||
acts, err := h.ragClient.CorpusStructure(c.Request.Context())
|
||||
if err != nil {
|
||||
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to aggregate legal corpus: " + err.Error()})
|
||||
return
|
||||
}
|
||||
|
||||
arts, anns, recs := 0, 0, 0
|
||||
for _, a := range acts {
|
||||
arts += a.Articles
|
||||
anns += a.Annexes
|
||||
recs += a.Recitals
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, gin.H{
|
||||
"regulations": acts,
|
||||
"totals": gin.H{
|
||||
"regulations": len(acts),
|
||||
"articles": arts,
|
||||
"annexes": anns,
|
||||
"recitals": recs,
|
||||
},
|
||||
})
|
||||
}
|
||||
|
||||
@@ -161,6 +161,7 @@ func registerRAGRoutes(v1 *gin.RouterGroup, h *handlers.RAGHandlers) {
|
||||
ragRoutes.GET("/corpus-status", h.CorpusStatus)
|
||||
ragRoutes.GET("/corpus-versions/:collection", h.CorpusVersionHistory)
|
||||
ragRoutes.GET("/scroll", h.HandleScrollChunks)
|
||||
ragRoutes.GET("/legal-corpus", h.LegalCorpusStructure)
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -0,0 +1,167 @@
|
||||
package ucca
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"sort"
|
||||
)
|
||||
|
||||
// LegalActStructure is the composition of one ingested eur-lex legal act — how
|
||||
// many distinct articles, annexes and recitals it consists of (plus the raw
|
||||
// chunk count). Backs the coverage page so the ingested corpus is not a black
|
||||
// box: a developer SEES what each act actually contains, not only its name.
|
||||
type LegalActStructure struct {
|
||||
RegulationShort string `json:"regulation_short"`
|
||||
RegulationName string `json:"regulation_name"`
|
||||
Articles int `json:"articles"`
|
||||
Annexes int `json:"annexes"`
|
||||
Recitals int `json:"recitals"`
|
||||
Chunks int `json:"chunks"`
|
||||
}
|
||||
|
||||
const eurlexSource = "eur-lex.europa.eu"
|
||||
|
||||
// legalStructureCollections hold the clean eur-lex legal corpus (chunks tagged
|
||||
// with chunk_scope = section | annex | recital).
|
||||
var legalStructureCollections = []string{"bp_compliance_ce", "bp_compliance_datenschutz"}
|
||||
|
||||
// chunkScopeBucket maps a Qdrant chunk_scope to the structure field it feeds.
|
||||
var chunkScopeBucket = map[string]string{"section": "articles", "annex": "annexes", "recital": "recitals"}
|
||||
|
||||
// CorpusStructure scrolls the eur-lex legal corpus across the legal collections
|
||||
// and aggregates the per-act composition. The source filter keeps it to a few
|
||||
// hundred points regardless of total corpus size. Read-only; a collection that
|
||||
// fails to scroll is skipped rather than failing the whole call.
|
||||
func (c *LegalRAGClient) CorpusStructure(ctx context.Context) ([]LegalActStructure, error) {
|
||||
var all []qdrantScrollPoint
|
||||
for _, coll := range legalStructureCollections {
|
||||
pts, err := c.scrollLegalCorpus(ctx, coll)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
all = append(all, pts...)
|
||||
}
|
||||
return aggregateStructure(all), nil
|
||||
}
|
||||
|
||||
// aggregateStructure counts distinct article labels per (regulation, scope).
|
||||
// Pure → unit-testable without a vector store.
|
||||
func aggregateStructure(points []qdrantScrollPoint) []LegalActStructure {
|
||||
distinct := map[string]map[string]map[string]struct{}{}
|
||||
names := map[string]string{}
|
||||
chunks := map[string]int{}
|
||||
order := []string{}
|
||||
|
||||
for _, pt := range points {
|
||||
reg := getString(pt.Payload, "regulation_short")
|
||||
if reg == "" {
|
||||
continue
|
||||
}
|
||||
if _, seen := names[reg]; !seen {
|
||||
name := getString(pt.Payload, "regulation_name_de")
|
||||
if name == "" {
|
||||
name = reg
|
||||
}
|
||||
names[reg] = name
|
||||
distinct[reg] = map[string]map[string]struct{}{}
|
||||
order = append(order, reg)
|
||||
}
|
||||
chunks[reg]++
|
||||
bucket, ok := chunkScopeBucket[getString(pt.Payload, "chunk_scope")]
|
||||
article := getString(pt.Payload, "article")
|
||||
if !ok || article == "" {
|
||||
continue
|
||||
}
|
||||
if distinct[reg][bucket] == nil {
|
||||
distinct[reg][bucket] = map[string]struct{}{}
|
||||
}
|
||||
distinct[reg][bucket][article] = struct{}{}
|
||||
}
|
||||
|
||||
out := make([]LegalActStructure, 0, len(order))
|
||||
for _, reg := range order {
|
||||
out = append(out, LegalActStructure{
|
||||
RegulationShort: reg,
|
||||
RegulationName: names[reg],
|
||||
Articles: len(distinct[reg]["articles"]),
|
||||
Annexes: len(distinct[reg]["annexes"]),
|
||||
Recitals: len(distinct[reg]["recitals"]),
|
||||
Chunks: chunks[reg],
|
||||
})
|
||||
}
|
||||
sort.SliceStable(out, func(i, j int) bool {
|
||||
if out[i].Articles != out[j].Articles {
|
||||
return out[i].Articles > out[j].Articles
|
||||
}
|
||||
return out[i].RegulationShort < out[j].RegulationShort
|
||||
})
|
||||
return out
|
||||
}
|
||||
|
||||
// scrollLegalCorpus pages through one collection, filtered to the eur-lex legal
|
||||
// corpus, returning minimal-payload points (no text/vectors).
|
||||
func (c *LegalRAGClient) scrollLegalCorpus(ctx context.Context, collection string) ([]qdrantScrollPoint, error) {
|
||||
var all []qdrantScrollPoint
|
||||
var offset interface{}
|
||||
for {
|
||||
points, next, err := c.scrollLegalPage(ctx, collection, offset)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
all = append(all, points...)
|
||||
if next == nil {
|
||||
break
|
||||
}
|
||||
offset = next
|
||||
}
|
||||
return all, nil
|
||||
}
|
||||
|
||||
// scrollLegalPage fetches one page of the filtered scroll and returns the
|
||||
// points plus the next-page offset (nil when exhausted).
|
||||
func (c *LegalRAGClient) scrollLegalPage(ctx context.Context, collection string, offset interface{}) ([]qdrantScrollPoint, interface{}, error) {
|
||||
reqBody := map[string]interface{}{
|
||||
"limit": 500,
|
||||
"with_payload": map[string]interface{}{"include": []string{"regulation_short", "regulation_name_de", "chunk_scope", "article"}},
|
||||
"with_vectors": false,
|
||||
"filter": map[string]interface{}{
|
||||
"must": []map[string]interface{}{
|
||||
{"key": "source", "match": map[string]interface{}{"value": eurlexSource}},
|
||||
},
|
||||
},
|
||||
}
|
||||
if offset != nil {
|
||||
reqBody["offset"] = offset
|
||||
}
|
||||
jsonBody, err := json.Marshal(reqBody)
|
||||
if err != nil {
|
||||
return nil, nil, err
|
||||
}
|
||||
url := fmt.Sprintf("%s/collections/%s/points/scroll", c.qdrantURL, collection)
|
||||
req, err := http.NewRequestWithContext(ctx, "POST", url, bytes.NewReader(jsonBody))
|
||||
if err != nil {
|
||||
return nil, nil, err
|
||||
}
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
if c.qdrantAPIKey != "" {
|
||||
req.Header.Set("api-key", c.qdrantAPIKey)
|
||||
}
|
||||
resp, err := c.httpClient.Do(req)
|
||||
if err != nil {
|
||||
return nil, nil, err
|
||||
}
|
||||
defer func() { _ = resp.Body.Close() }()
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
body, _ := io.ReadAll(resp.Body)
|
||||
return nil, nil, fmt.Errorf("qdrant returned %d: %s", resp.StatusCode, string(body))
|
||||
}
|
||||
var scrollResp qdrantScrollResponse
|
||||
if err := json.NewDecoder(resp.Body).Decode(&scrollResp); err != nil {
|
||||
return nil, nil, err
|
||||
}
|
||||
return scrollResp.Result.Points, scrollResp.Result.NextPageOffset, nil
|
||||
}
|
||||
@@ -0,0 +1,50 @@
|
||||
package ucca
|
||||
|
||||
import "testing"
|
||||
|
||||
func structPoint(reg, name, scope, article string) qdrantScrollPoint {
|
||||
return qdrantScrollPoint{Payload: map[string]interface{}{
|
||||
"regulation_short": reg,
|
||||
"regulation_name_de": name,
|
||||
"chunk_scope": scope,
|
||||
"article": article,
|
||||
}}
|
||||
}
|
||||
|
||||
func TestAggregateStructure_CountsDistinctPerScope(t *testing.T) {
|
||||
points := []qdrantScrollPoint{
|
||||
structPoint("CRA", "Cyber Resilience Act", "section", "13"),
|
||||
structPoint("CRA", "Cyber Resilience Act", "section", "13"), // duplicate article → still 1
|
||||
structPoint("CRA", "Cyber Resilience Act", "section", "14"),
|
||||
structPoint("CRA", "Cyber Resilience Act", "annex", "Anhang-I"),
|
||||
structPoint("CRA", "Cyber Resilience Act", "annex", "Anhang-VII"),
|
||||
structPoint("DORA", "", "section", "6"), // first sighting has no name →
|
||||
structPoint("DORA", "", "section", "19"), // regulation_name falls back to short
|
||||
structPoint("DORA", "", "recital", ""), // empty article → ignored for distinct
|
||||
structPoint("", "x", "section", "1"), // missing regulation → skipped entirely
|
||||
}
|
||||
|
||||
got := aggregateStructure(points)
|
||||
|
||||
if len(got) != 2 {
|
||||
t.Fatalf("want 2 acts, got %d (%+v)", len(got), got)
|
||||
}
|
||||
// CRA has more articles → sorts first.
|
||||
cra := got[0]
|
||||
if cra.RegulationShort != "CRA" || cra.Articles != 2 || cra.Annexes != 2 || cra.Recitals != 0 || cra.Chunks != 5 {
|
||||
t.Errorf("CRA wrong: %+v", cra)
|
||||
}
|
||||
dora := got[1]
|
||||
if dora.RegulationShort != "DORA" || dora.Articles != 2 || dora.Chunks != 3 {
|
||||
t.Errorf("DORA wrong: %+v", dora)
|
||||
}
|
||||
if dora.RegulationName != "DORA" {
|
||||
t.Errorf("DORA name fallback failed: %q", dora.RegulationName)
|
||||
}
|
||||
}
|
||||
|
||||
func TestAggregateStructure_Empty(t *testing.T) {
|
||||
if got := aggregateStructure(nil); len(got) != 0 {
|
||||
t.Errorf("want empty, got %+v", got)
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,134 @@
|
||||
package ucca
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
)
|
||||
|
||||
const (
|
||||
assessConnectedCap = 12 // cap connected norms surfaced in the assessment
|
||||
assessCrossRegimeTopN = 5 // window over which "cross regime" is judged
|
||||
assessReviewMargin = 0.05 // a tighter winner gap → recommend human review
|
||||
)
|
||||
|
||||
// Assess builds the auditable explanation layer over a ranked result set:
|
||||
// primary norm, the norms it connects to (citation graph), cross-regime, a
|
||||
// human-review flag, the winner margin and a short reasoning string. Pure →
|
||||
// unit-testable. It EXPLAINS the ranking, it does not change it. Returns nil for
|
||||
// an empty result set.
|
||||
func Assess(results []LegalSearchResult) *LegalAssessment {
|
||||
if len(results) == 0 {
|
||||
return nil
|
||||
}
|
||||
// Norm-level view: collapse multiple chunks of the same article/annex so the
|
||||
// margin and cross-regime are judged between DISTINCT norms, not near-identical
|
||||
// chunks of one norm (which would make every winner margin ~0).
|
||||
norms := distinctNorms(results)
|
||||
p := norms[0]
|
||||
|
||||
primary := primaryLabel(p)
|
||||
connected := dedupStrings(p.ReferencesOut, p.ReferencesIn, p.CitationUnit)
|
||||
if len(connected) > assessConnectedCap {
|
||||
connected = connected[:assessConnectedCap]
|
||||
}
|
||||
|
||||
window := norms
|
||||
if len(window) > assessCrossRegimeTopN {
|
||||
window = window[:assessCrossRegimeTopN]
|
||||
}
|
||||
regimes := make(map[string]bool)
|
||||
for _, r := range window {
|
||||
if r.RegulationShort != "" {
|
||||
regimes[r.RegulationShort] = true
|
||||
}
|
||||
}
|
||||
crossRegime := len(regimes) > 1
|
||||
|
||||
margin := 0.0
|
||||
if len(norms) > 1 {
|
||||
margin = norms[0].Score - norms[1].Score
|
||||
}
|
||||
|
||||
primaryBinding := p.SourceClass == "binding_law"
|
||||
humanReview := margin < assessReviewMargin || crossRegime || !primaryBinding
|
||||
|
||||
return &LegalAssessment{
|
||||
PrimaryNorm: primary,
|
||||
PrimaryRegulation: p.RegulationShort,
|
||||
ConnectedNorms: connected,
|
||||
CrossRegime: crossRegime,
|
||||
HumanReviewFlag: humanReview,
|
||||
WinnerMargin: margin,
|
||||
ScoreReasoning: assessReasoning(p, margin, crossRegime, primaryBinding),
|
||||
}
|
||||
}
|
||||
|
||||
func primaryLabel(p LegalSearchResult) string {
|
||||
if p.CitationUnit != "" {
|
||||
return p.CitationUnit
|
||||
}
|
||||
if p.ArticleLabel != "" {
|
||||
return p.ArticleLabel
|
||||
}
|
||||
return strings.TrimSpace(p.RegulationShort + " " + p.Article)
|
||||
}
|
||||
|
||||
// assessReasoning renders a short, human-readable justification (German).
|
||||
func assessReasoning(p LegalSearchResult, margin float64, crossRegime, primaryBinding bool) string {
|
||||
label := primaryLabel(p)
|
||||
parts := make([]string, 0, 4)
|
||||
if primaryBinding {
|
||||
parts = append(parts, fmt.Sprintf("Primärtreffer %s: bindendes Recht (Autorität %d).", label, p.AuthorityWeight))
|
||||
} else {
|
||||
parts = append(parts, fmt.Sprintf("Primärtreffer %s ist keine bindende Norm (Leitlinie/Standard) — Quelle prüfen.", label))
|
||||
}
|
||||
if margin > 0 {
|
||||
parts = append(parts, fmt.Sprintf("Vorsprung %.2f vor #2.", margin))
|
||||
}
|
||||
if margin < assessReviewMargin {
|
||||
parts = append(parts, "Knapper Vorsprung — Alternativtreffer prüfen.")
|
||||
}
|
||||
if crossRegime {
|
||||
parts = append(parts, "Mehrere Regime betroffen — Querbezug prüfen.")
|
||||
}
|
||||
return strings.Join(parts, " ")
|
||||
}
|
||||
|
||||
// distinctNorms collapses results that share a citation (multiple chunks of the
|
||||
// same article/annex) to the first — i.e. highest-ranked — occurrence. Results
|
||||
// without any citation identity are each kept, since they cannot be matched.
|
||||
func distinctNorms(results []LegalSearchResult) []LegalSearchResult {
|
||||
seen := make(map[string]bool, len(results))
|
||||
out := make([]LegalSearchResult, 0, len(results))
|
||||
for _, r := range results {
|
||||
key := r.CitationUnit
|
||||
if key == "" {
|
||||
key = r.ArticleLabel
|
||||
}
|
||||
if key != "" {
|
||||
if seen[key] {
|
||||
continue
|
||||
}
|
||||
seen[key] = true
|
||||
}
|
||||
out = append(out, r)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
// dedupStrings concatenates out+in, drops empties and the excluded value, and
|
||||
// returns a stable de-duplicated slice (insertion order preserved).
|
||||
func dedupStrings(out, in []string, exclude string) []string {
|
||||
seen := map[string]bool{exclude: true}
|
||||
res := make([]string, 0, len(out)+len(in))
|
||||
for _, list := range [][]string{out, in} {
|
||||
for _, s := range list {
|
||||
if s == "" || seen[s] {
|
||||
continue
|
||||
}
|
||||
seen[s] = true
|
||||
res = append(res, s)
|
||||
}
|
||||
}
|
||||
return res
|
||||
}
|
||||
@@ -0,0 +1,112 @@
|
||||
package ucca
|
||||
|
||||
import "testing"
|
||||
|
||||
func ares(reg, cu, sc string, score float64, weight int, out, in []string) LegalSearchResult {
|
||||
return LegalSearchResult{
|
||||
RegulationShort: reg, CitationUnit: cu, SourceClass: sc, Score: score,
|
||||
AuthorityWeight: weight, ReferencesOut: out, ReferencesIn: in,
|
||||
}
|
||||
}
|
||||
|
||||
func TestAssess_Empty(t *testing.T) {
|
||||
if Assess(nil) != nil {
|
||||
t.Error("empty results → nil assessment")
|
||||
}
|
||||
}
|
||||
|
||||
func TestAssess_BindingPrimary_NoReview(t *testing.T) {
|
||||
results := []LegalSearchResult{
|
||||
ares("CRA", "Art. 13 CRA", "binding_law", 1.05, 100,
|
||||
[]string{"CRA Anhang I", "Art. 14 CRA"}, []string{"Art. 12 CRA"}),
|
||||
ares("CRA", "Art. 14 CRA", "binding_law", 0.80, 100, nil, nil),
|
||||
}
|
||||
a := Assess(results)
|
||||
if a == nil {
|
||||
t.Fatal("nil assessment")
|
||||
}
|
||||
if a.PrimaryNorm != "Art. 13 CRA" || a.PrimaryRegulation != "CRA" {
|
||||
t.Errorf("primary wrong: %+v", a)
|
||||
}
|
||||
if len(a.ConnectedNorms) != 3 { // out(2) + in(1), self excluded, deduped
|
||||
t.Errorf("connected norms: %v", a.ConnectedNorms)
|
||||
}
|
||||
if a.CrossRegime {
|
||||
t.Error("single regime must not be cross-regime")
|
||||
}
|
||||
if a.WinnerMargin < 0.24 || a.WinnerMargin > 0.26 {
|
||||
t.Errorf("margin = %v, want ~0.25", a.WinnerMargin)
|
||||
}
|
||||
if a.HumanReviewFlag {
|
||||
t.Error("clean binding + healthy margin + single regime → no review")
|
||||
}
|
||||
}
|
||||
|
||||
func TestAssess_CrossRegimeFlagsReview(t *testing.T) {
|
||||
a := Assess([]LegalSearchResult{
|
||||
ares("CRA", "Art. 13 CRA", "binding_law", 1.05, 100, nil, nil),
|
||||
ares("DORA", "Art. 6 DORA", "binding_law", 0.70, 100, nil, nil),
|
||||
})
|
||||
if !a.CrossRegime || !a.HumanReviewFlag {
|
||||
t.Errorf("cross-regime must flag review: %+v", a)
|
||||
}
|
||||
}
|
||||
|
||||
func TestAssess_NonBindingFlagsReview(t *testing.T) {
|
||||
a := Assess([]LegalSearchResult{
|
||||
ares("ENISA", "ENISA SBOM", "supervisory_guidance", 0.90, 70, nil, nil),
|
||||
ares("ENISA", "ENISA X", "supervisory_guidance", 0.40, 70, nil, nil),
|
||||
})
|
||||
if !a.HumanReviewFlag {
|
||||
t.Error("non-binding primary → review")
|
||||
}
|
||||
}
|
||||
|
||||
func TestAssess_TightMarginFlagsReview(t *testing.T) {
|
||||
a := Assess([]LegalSearchResult{
|
||||
ares("CRA", "Art. 13 CRA", "binding_law", 1.00, 100, nil, nil),
|
||||
ares("CRA", "Art. 14 CRA", "binding_law", 0.98, 100, nil, nil),
|
||||
})
|
||||
if a.WinnerMargin >= 0.05 || !a.HumanReviewFlag {
|
||||
t.Errorf("tight margin → review: %+v", a)
|
||||
}
|
||||
}
|
||||
|
||||
func TestAssess_MarginIsNormLevelNotChunkLevel(t *testing.T) {
|
||||
// Two near-identical chunks of the SAME norm at the top, then a distinct norm.
|
||||
results := []LegalSearchResult{
|
||||
ares("CRA", "Art. 13 CRA", "binding_law", 1.050, 100, []string{"CRA Anhang I"}, nil),
|
||||
ares("CRA", "Art. 13 CRA", "binding_law", 1.049, 100, nil, nil), // same norm
|
||||
ares("CRA", "Art. 14 CRA", "binding_law", 0.800, 100, nil, nil),
|
||||
}
|
||||
a := Assess(results)
|
||||
if a.WinnerMargin < 0.24 || a.WinnerMargin > 0.26 { // Art.13 vs Art.14, not chunk vs chunk
|
||||
t.Errorf("margin must be norm-level (~0.25), got %v", a.WinnerMargin)
|
||||
}
|
||||
if a.HumanReviewFlag {
|
||||
t.Error("healthy norm-level margin → no review")
|
||||
}
|
||||
}
|
||||
|
||||
func TestDistinctNorms(t *testing.T) {
|
||||
got := distinctNorms([]LegalSearchResult{
|
||||
{CitationUnit: "Art. 13 CRA"},
|
||||
{CitationUnit: "Art. 13 CRA"}, // duplicate norm → collapsed
|
||||
{CitationUnit: "Art. 14 CRA"},
|
||||
{CitationUnit: ""}, // no identity → kept
|
||||
{CitationUnit: ""}, // no identity → kept
|
||||
})
|
||||
if len(got) != 4 {
|
||||
t.Errorf("want 4 (2 distinct + 2 unidentified), got %d", len(got))
|
||||
}
|
||||
}
|
||||
|
||||
func TestDedupStrings(t *testing.T) {
|
||||
got := dedupStrings([]string{"a", "b", "", "a"}, []string{"b", "c"}, "self")
|
||||
if len(got) != 3 || got[0] != "a" || got[1] != "b" || got[2] != "c" {
|
||||
t.Errorf("dedup: %v", got)
|
||||
}
|
||||
if len(dedupStrings([]string{"self"}, nil, "self")) != 0 {
|
||||
t.Error("excluded value must be dropped")
|
||||
}
|
||||
}
|
||||
@@ -20,6 +20,7 @@ type LegalRAGClient struct {
|
||||
httpClient *http.Client
|
||||
textIndexEnsured map[string]bool
|
||||
hybridEnabled bool
|
||||
graphEnabled bool
|
||||
}
|
||||
|
||||
// NewLegalRAGClient creates a new Legal RAG client using Ollama bge-m3 embeddings.
|
||||
@@ -38,6 +39,11 @@ func NewLegalRAGClient() *LegalRAGClient {
|
||||
}
|
||||
|
||||
hybridEnabled := os.Getenv("RAG_HYBRID_SEARCH") != "false"
|
||||
// Graph-Expansion ist OPT-IN: kein gemessener Rang-Nutzen ggue. der Binding-Augmentation,
|
||||
// +1 Qdrant-Call/Suche, Flutungsrisiko ueber Reverse-Kanten. Bleibt als Recall-Sicherheitsnetz
|
||||
// fuer spaetere Luecken (RAG_GRAPH_EXPANSION=true). Die Graph-Kanten werden in der Response
|
||||
// zur Begruendung/Vollstaendigkeit genutzt, nicht zur Pool-Expansion (Default).
|
||||
graphEnabled := os.Getenv("RAG_GRAPH_EXPANSION") == "true"
|
||||
|
||||
return &LegalRAGClient{
|
||||
qdrantURL: qdrantURL,
|
||||
@@ -47,6 +53,7 @@ func NewLegalRAGClient() *LegalRAGClient {
|
||||
collection: "bp_compliance_ce",
|
||||
textIndexEnsured: make(map[string]bool),
|
||||
hybridEnabled: hybridEnabled,
|
||||
graphEnabled: graphEnabled,
|
||||
httpClient: &http.Client{
|
||||
Timeout: 60 * time.Second,
|
||||
},
|
||||
@@ -100,6 +107,13 @@ func (c *LegalRAGClient) searchInternal(ctx context.Context, collection string,
|
||||
hits = mergeDedupHits(hits, bindingHits)
|
||||
}
|
||||
|
||||
// Graph-Augmentation: verbundene Normen (references_out/in) der Top-Hits ueber die
|
||||
// praezise Zitations-Kante in den Pool ziehen — z.B. Art. 13 CRA zieht Anhang I (die
|
||||
// eigentliche Pflichtquelle). Pool-Augmentation only; Re-Rank + topK bleiben.
|
||||
if c.graphEnabled {
|
||||
hits = c.expandViaGraph(ctx, collection, hits)
|
||||
}
|
||||
|
||||
results := make([]LegalSearchResult, len(hits))
|
||||
for i, hit := range hits {
|
||||
// Legal-Metadaten nach rag_reingest_spec.md §2: bevorzugt die normalisierten Felder
|
||||
@@ -131,6 +145,9 @@ func (c *LegalRAGClient) searchInternal(ctx context.Context, collection string,
|
||||
AuthorityWeight: getInt(hit.Payload, "authority_weight"),
|
||||
SourceClass: getString(hit.Payload, "source_class"),
|
||||
Jurisdiction: getString(hit.Payload, "jurisdiction"),
|
||||
CitationUnit: getString(hit.Payload, "citation_unit"),
|
||||
ReferencesOut: getStringSlice(hit.Payload, "references_out"),
|
||||
ReferencesIn: getStringSlice(hit.Payload, "references_in"),
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -0,0 +1,162 @@
|
||||
package ucca
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"sort"
|
||||
)
|
||||
|
||||
// Graph-augmented retrieval: when a top hit cites an annex/article (references_out)
|
||||
// or is cited by one (references_in), pull that connected norm into the candidate
|
||||
// pool via the PRECISE citation graph instead of hoping semantic search surfaces
|
||||
// it. E.g. a hit on CRA Art. 13 pulls in CRA Anhang I (the actual requirement).
|
||||
// Pool-augmentation only — authority re-rank + topK slice still apply, so the
|
||||
// response schema is unchanged.
|
||||
const (
|
||||
graphSeedCount = 5 // only the top hits seed the expansion
|
||||
graphMaxExpand = 15 // cap connected norms pulled in (avoid pool explosion)
|
||||
graphHopPenalty = 0.05 // a one-hop neighbour ranks just below its seed
|
||||
)
|
||||
|
||||
// expandViaGraph augments hits with the norms they cite and the norms that cite
|
||||
// them. Best-effort: on any error (or nothing to expand) the original hits are
|
||||
// returned unchanged.
|
||||
func (c *LegalRAGClient) expandViaGraph(ctx context.Context, collection string, hits []qdrantSearchHit) []qdrantSearchHit {
|
||||
if len(hits) == 0 {
|
||||
return hits
|
||||
}
|
||||
present := make(map[string]bool, len(hits))
|
||||
for _, h := range hits {
|
||||
if cu := getString(h.Payload, "citation_unit"); cu != "" {
|
||||
present[cu] = true
|
||||
}
|
||||
}
|
||||
|
||||
seeds := hits
|
||||
if len(seeds) > graphSeedCount {
|
||||
seeds = seeds[:graphSeedCount]
|
||||
}
|
||||
// Forward edges only (references_out = the detail a hit explicitly points to,
|
||||
// e.g. Art. 13 → Anhang I). Reverse (references_in) has high fan-out for popular
|
||||
// annexes (Anhang I is cited by 23 articles) → pool flooding; it is surfaced as
|
||||
// connected-norm metadata in the Phase 2 response instead of expanding the pool.
|
||||
want := make(map[string]float64) // connected citation_unit -> best seeding score
|
||||
for _, h := range seeds {
|
||||
for _, cu := range getStringSlice(h.Payload, "references_out") {
|
||||
if cu == "" || present[cu] {
|
||||
continue
|
||||
}
|
||||
if s, ok := want[cu]; !ok || h.Score > s {
|
||||
want[cu] = h.Score
|
||||
}
|
||||
}
|
||||
}
|
||||
if len(want) == 0 {
|
||||
return hits
|
||||
}
|
||||
|
||||
units := topByScore(want, graphMaxExpand)
|
||||
fetched, err := c.fetchByCitationUnits(ctx, collection, units)
|
||||
if err != nil || len(fetched) == 0 {
|
||||
return hits
|
||||
}
|
||||
neighbours := make([]qdrantSearchHit, 0, len(fetched))
|
||||
for cu, pt := range fetched {
|
||||
neighbours = append(neighbours, qdrantSearchHit{ID: pt.ID, Score: want[cu] - graphHopPenalty, Payload: pt.Payload})
|
||||
}
|
||||
return mergeDedupHits(hits, neighbours)
|
||||
}
|
||||
|
||||
// topByScore returns up to n keys with the highest values. Deterministic: ties
|
||||
// broken by the key string so the cap is stable across runs.
|
||||
func topByScore(m map[string]float64, n int) []string {
|
||||
keys := make([]string, 0, len(m))
|
||||
for k := range m {
|
||||
keys = append(keys, k)
|
||||
}
|
||||
sort.Slice(keys, func(i, j int) bool {
|
||||
if m[keys[i]] != m[keys[j]] {
|
||||
return m[keys[i]] > m[keys[j]]
|
||||
}
|
||||
return keys[i] < keys[j]
|
||||
})
|
||||
if len(keys) > n {
|
||||
keys = keys[:n]
|
||||
}
|
||||
return keys
|
||||
}
|
||||
|
||||
// fetchByCitationUnits loads one representative point (the first chunk) per
|
||||
// citation_unit from the given collection.
|
||||
func (c *LegalRAGClient) fetchByCitationUnits(ctx context.Context, collection string, units []string) (map[string]qdrantScrollPoint, error) {
|
||||
should := make([]map[string]interface{}, 0, len(units))
|
||||
for _, cu := range units {
|
||||
should = append(should, map[string]interface{}{"key": "citation_unit", "match": map[string]interface{}{"value": cu}})
|
||||
}
|
||||
reqBody := map[string]interface{}{
|
||||
"limit": len(units) * 4,
|
||||
"with_payload": true,
|
||||
"with_vectors": false,
|
||||
"filter": map[string]interface{}{"should": should},
|
||||
}
|
||||
jsonBody, err := json.Marshal(reqBody)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
url := fmt.Sprintf("%s/collections/%s/points/scroll", c.qdrantURL, collection)
|
||||
req, err := http.NewRequestWithContext(ctx, "POST", url, bytes.NewReader(jsonBody))
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
if c.qdrantAPIKey != "" {
|
||||
req.Header.Set("api-key", c.qdrantAPIKey)
|
||||
}
|
||||
resp, err := c.httpClient.Do(req)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer func() { _ = resp.Body.Close() }()
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
body, _ := io.ReadAll(resp.Body)
|
||||
return nil, fmt.Errorf("qdrant scroll returned %d: %s", resp.StatusCode, string(body))
|
||||
}
|
||||
var scrollResp qdrantScrollResponse
|
||||
if err := json.NewDecoder(resp.Body).Decode(&scrollResp); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
out := make(map[string]qdrantScrollPoint, len(units))
|
||||
for _, pt := range scrollResp.Result.Points {
|
||||
cu := getString(pt.Payload, "citation_unit")
|
||||
if cu != "" {
|
||||
if _, seen := out[cu]; !seen {
|
||||
out[cu] = pt
|
||||
}
|
||||
}
|
||||
}
|
||||
return out, nil
|
||||
}
|
||||
|
||||
// getStringSlice extracts a []string from a Qdrant payload list field
|
||||
// (references_out / references_in are stored as JSON arrays of strings).
|
||||
func getStringSlice(m map[string]interface{}, key string) []string {
|
||||
v, ok := m[key]
|
||||
if !ok {
|
||||
return nil
|
||||
}
|
||||
arr, ok := v.([]interface{})
|
||||
if !ok {
|
||||
return nil
|
||||
}
|
||||
out := make([]string, 0, len(arr))
|
||||
for _, item := range arr {
|
||||
if s, ok := item.(string); ok {
|
||||
out = append(out, s)
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
@@ -0,0 +1,89 @@
|
||||
package ucca
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestGetStringSlice(t *testing.T) {
|
||||
m := map[string]interface{}{
|
||||
"refs": []interface{}{"a", "b", 3, "c"}, // non-strings are skipped
|
||||
"str": "not-a-list",
|
||||
}
|
||||
got := getStringSlice(m, "refs")
|
||||
if len(got) != 3 || got[0] != "a" || got[2] != "c" {
|
||||
t.Errorf("refs: %v", got)
|
||||
}
|
||||
if getStringSlice(m, "missing") != nil {
|
||||
t.Error("missing key should be nil")
|
||||
}
|
||||
if getStringSlice(m, "str") != nil {
|
||||
t.Error("non-list should be nil")
|
||||
}
|
||||
}
|
||||
|
||||
func TestTopByScore_DeterministicCap(t *testing.T) {
|
||||
m := map[string]float64{"x": 0.5, "y": 0.9, "z": 0.5, "w": 0.7}
|
||||
got := topByScore(m, 2)
|
||||
if len(got) != 2 || got[0] != "y" || got[1] != "w" {
|
||||
t.Errorf("want [y w], got %v", got)
|
||||
}
|
||||
all := topByScore(m, 10)
|
||||
if all[2] != "x" || all[3] != "z" { // tie 0.5 broken by key string
|
||||
t.Errorf("tie-break not deterministic: %v", all)
|
||||
}
|
||||
}
|
||||
|
||||
func TestExpandViaGraph_NoSeedsOrRefs(t *testing.T) {
|
||||
c := &LegalRAGClient{} // nil httpClient → must not be called on these paths
|
||||
if out := c.expandViaGraph(context.Background(), "x", nil); out != nil {
|
||||
t.Error("empty hits should return nil")
|
||||
}
|
||||
hits := []qdrantSearchHit{{ID: 1, Score: 0.8, Payload: map[string]interface{}{"citation_unit": "Art. 1 CRA"}}}
|
||||
if out := c.expandViaGraph(context.Background(), "x", hits); len(out) != 1 {
|
||||
t.Errorf("no references → unchanged, got %d", len(out))
|
||||
}
|
||||
}
|
||||
|
||||
func TestExpandViaGraph_PullsConnectedNorm(t *testing.T) {
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
|
||||
_ = json.NewEncoder(w).Encode(map[string]interface{}{
|
||||
"result": map[string]interface{}{
|
||||
"points": []map[string]interface{}{
|
||||
{"id": 99, "payload": map[string]interface{}{
|
||||
"citation_unit": "CRA Anhang I", "chunk_text": "Sicherheitsanforderungen",
|
||||
"source_class": "binding_law", "authority_weight": 100, "regulation_short": "CRA",
|
||||
}},
|
||||
},
|
||||
"next_page_offset": nil,
|
||||
},
|
||||
})
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
c := &LegalRAGClient{qdrantURL: srv.URL, httpClient: srv.Client()}
|
||||
hits := []qdrantSearchHit{
|
||||
{ID: 1, Score: 0.70, Payload: map[string]interface{}{
|
||||
"citation_unit": "Art. 13 CRA", "references_out": []interface{}{"CRA Anhang I"},
|
||||
}},
|
||||
}
|
||||
out := c.expandViaGraph(context.Background(), "bp_compliance_ce", hits)
|
||||
if len(out) != 2 {
|
||||
t.Fatalf("want 2 hits (seed + connected annex), got %d", len(out))
|
||||
}
|
||||
var found *qdrantSearchHit
|
||||
for i := range out {
|
||||
if getString(out[i].Payload, "citation_unit") == "CRA Anhang I" {
|
||||
found = &out[i]
|
||||
}
|
||||
}
|
||||
if found == nil {
|
||||
t.Fatal("connected norm CRA Anhang I was not pulled into the pool")
|
||||
}
|
||||
if found.Score < 0.64 || found.Score > 0.66 { // 0.70 seed − 0.05 hop penalty
|
||||
t.Errorf("connected score = %v, want ~0.65", found.Score)
|
||||
}
|
||||
}
|
||||
@@ -27,6 +27,27 @@ type LegalSearchResult struct {
|
||||
AuthorityWeight int `json:"-"`
|
||||
SourceClass string `json:"-"`
|
||||
Jurisdiction string `json:"-"`
|
||||
|
||||
// Zitations-Graph (Phase 2) — intern, speist nur die Assessment-Berechnung
|
||||
// (verbundene Normen, Begruendung). Pro-Result-Schema bleibt eingefroren.
|
||||
CitationUnit string `json:"-"`
|
||||
ReferencesOut []string `json:"-"`
|
||||
ReferencesIn []string `json:"-"`
|
||||
}
|
||||
|
||||
// LegalAssessment is the auditable explanation layer over a ranked result set:
|
||||
// which norm is primary, which norms connect to it via the citation graph,
|
||||
// whether the answer crosses regulatory regimes, and whether a human should
|
||||
// review. Computed from the already-ranked results — it EXPLAINS retrieval, it
|
||||
// does not change it (graph edges for reasoning/completeness, not pool-expansion).
|
||||
type LegalAssessment struct {
|
||||
PrimaryNorm string `json:"primary_norm"`
|
||||
PrimaryRegulation string `json:"primary_regulation"`
|
||||
ConnectedNorms []string `json:"connected_norms"`
|
||||
CrossRegime bool `json:"cross_regime"`
|
||||
HumanReviewFlag bool `json:"human_review_flag"`
|
||||
WinnerMargin float64 `json:"winner_margin"`
|
||||
ScoreReasoning string `json:"score_reasoning"`
|
||||
}
|
||||
|
||||
// LegalContext represents aggregated legal context for an assessment.
|
||||
|
||||
Reference in New Issue
Block a user