feat(pipeline): MC Quality Overhaul — 74.5% → 92.8% accuracy, 5.3K → 13.6K MCs
Phase 0: Quality Audit script (Claude Sonnet, 1750 samples) Phase 1: Object ontology expanded 31 → 74 tokens with descriptions + boundaries Phase 2: 174K controls re-classified via Haiku (10 batches, $50) - Generic tokens removed (documentation, procedure, process) - L2 sub-topics added (108K + 64K controls) - Bad subtopics fixed (stakeholder_*, escalation fragments) Phase 3: Re-clustering K=18704 (37K objects → 16.7K groups) Phase 4: Direct MC generation from canonical tokens (gpre2_direct_mc.py) Phase 5: Regulation-source split (gpre3, dry-run tested) New features: - Tenant-isolated document upload API (rag-service) - BAuA crawler (Playwright, 131 PDFs downloaded) - OSHA Technical Manual crawler (23 chapters) - CE obligation extractor (6141 obligations from Qdrant) RAG ingestion: - 126 BAuA PDFs (TRBS/TRGS/ASR): 27,664 chunks - OSHA Technical Manual: 7,241 chunks - OSHA 1910 Subpart O (full): 745 chunks - EuGH C-588/21 P: 216 chunks - EU 2018/1725: 842 chunks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,79 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Crawl OSHA Technical Manual — all chapters as HTML."""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
logger = logging.getLogger("osha-crawl")
|
||||
|
||||
OUTPUT_DIR = Path(__file__).parent / "otm_chapters"
|
||||
BASE = "https://www.osha.gov"
|
||||
|
||||
|
||||
def main():
|
||||
OUTPUT_DIR.mkdir(exist_ok=True)
|
||||
registry = []
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=False)
|
||||
page = browser.new_page()
|
||||
|
||||
# Step 1: Get all chapter URLs
|
||||
page.goto(f"{BASE}/otm", timeout=30000)
|
||||
time.sleep(5)
|
||||
|
||||
links = page.query_selector_all('a[href*="/otm/"]')
|
||||
chapters = []
|
||||
seen = set()
|
||||
for l in links:
|
||||
href = l.get_attribute("href") or ""
|
||||
text = (l.inner_text() or "").strip()
|
||||
if href and "chapter" in href and href not in seen and text:
|
||||
seen.add(href)
|
||||
chapters.append({"url": href, "title": text})
|
||||
|
||||
logger.info("Found %d chapters", len(chapters))
|
||||
|
||||
# Step 2: Download each chapter
|
||||
for i, ch in enumerate(chapters):
|
||||
url = ch["url"] if ch["url"].startswith("http") else BASE + ch["url"]
|
||||
slug = ch["url"].replace("/otm/", "").replace("/", "_")
|
||||
outfile = OUTPUT_DIR / f"{slug}.html"
|
||||
|
||||
logger.info("[%d/%d] %s", i + 1, len(chapters), ch["title"][:60])
|
||||
|
||||
if outfile.exists():
|
||||
logger.info(" Already exists, skipping")
|
||||
ch["local_path"] = str(outfile)
|
||||
registry.append(ch)
|
||||
continue
|
||||
|
||||
try:
|
||||
page.goto(url, timeout=30000)
|
||||
time.sleep(3)
|
||||
content = page.content()
|
||||
outfile.write_text(content)
|
||||
ch["local_path"] = str(outfile)
|
||||
logger.info(" Saved: %s (%.1f KB)", outfile.name, len(content) / 1024)
|
||||
except Exception as e:
|
||||
logger.error(" Failed: %s", e)
|
||||
ch["local_path"] = None
|
||||
|
||||
registry.append(ch)
|
||||
time.sleep(1)
|
||||
|
||||
browser.close()
|
||||
|
||||
reg_file = Path(__file__).parent / "otm_registry.json"
|
||||
reg_file.write_text(json.dumps(registry, indent=2, ensure_ascii=False))
|
||||
ok = sum(1 for r in registry if r.get("local_path"))
|
||||
logger.info("Done: %d/%d chapters saved", ok, len(registry))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user