Changelog
Recent Changes
May 2026
- Chunker coalesces short chunks — The
min_chunk_sizefilter no longer drops sub-threshold chunks. It now coalesces them with a neighbor (merging into the next chunk that lifts the combination over the threshold) so natural-prose imports — dialogue, transitions, short paragraphs — keep all content reaching extraction. Fixes a W5 data-loss regression observed onwar_and_peace.txt, where 80 chunks of real Tolstoy prose were being silently discarded. Defaultmin_chunk_sizelowered from 500 to 100 to keep the merging gentle on natural prose. Thechunks_filtered_countcounter (and the "Chunks coalesced" tile on the Data Quality tab) now records merge events, not drops. - Upload-settings persistence — Every choice you make at upload time (
auto_analyze,enable_normalization,enable_vision,content_filtering,filtering_mode) is now a real column on the source row. Recovery, retry, and re-extract reuse what you set without you having to re-pass it. - Data Quality tab + 15 quality counters — Every silent-drop site in the pipeline (loader / cleaner / chunking / LLM / post-extraction / commit) now increments a typed counter on the source row. The Data Quality tab on the source detail page surfaces all 15 counters with plain-English explanations, distinct from the existing Quality grade. Counters reset on Re-extract so you can compare runs.
- Filtering modes 0–5 redesigned — The slider now produces distinct results at every level. Three previously-dead settings (
loop_max_entity_count,semantic_dedup_threshold,minimum_alias_length) are now wired so each preset (unfiltered/minimal/lenient/balanced/strict/maximum) tunes the pipeline differently. Plain-English documentation at Filtering Modes. - Production extraction parity — Cortex, the standalone CLI, and the MCP path all share one post-extraction helper (
apply_structural_and_normalization). The same source produces the same graph regardless of which entry point ran the extraction. - Normalization & chunking honesty — Operator's
NormalizerSettingsnow actually reach the cleaners (was silently ignored).min_chunk_size/max_chunk_size/respect_boundariesare wired through to the splitter. Zero-chunk sources raiseValidationErrorwith an actionable hint instead of committing silently. The OCR cleaner is scoped to OCR-derived content viaapplies_to(metadata)so short identifiers likegit/npm/K8ssurvive on plain text and HTML. - Loader correctness — Shared
detect_encoding()helper for all text-shaped loaders (UTF-8 strict → cp1252 strict → charset-normalizer → Latin-1, no silenterrors="replace"). JSONL parsed line-by-line with per-line error isolation. CSV uses a dialect sniffer. Scanned-PDF specific errors.application/octet-streamremoved from the default upload allowlist (operators who need it can opt back in). - 6 new built-in loaders — HTML, RST, DOCX, XLSX, PPTX, EPUB. EPUB hand-rolled to avoid taking on an AGPL
ebooklibdependency. - LLM observability —
finish_reasonpopulated by all 4 providers (Ollama, OpenAI, Anthropic, Gemini) and normalized to a stable vocabulary (stop/length/content_filter/tool_calls/error/unknown). Streaming line-buffer flushes the trailing partial line so the last entity isn't silently dropped. Chunk-levelfinish_reasonandaborted_by_loopsurface on the extraction-task API (migration 0022); chunk truncation and abort counters surface on the source row. - Upload contract hardening — URL fetcher validates the upstream
Content-Typeagainst the allowlist, honors anycharset=…parameter, and routes binary responses through the binary loader path. CLI fully matches the API contract:--vision/--no-vision,--content-filtering/--no-content-filtering,--normalize/--no-normalize,--filtering-mode,--skip-duplicates. - Vector search visibility —
vector_indexing_statusfield with four states (pending,indexed,degraded,failed). NewSearchStatusBadgeUI component on the source list and detail page. The orphan-sweep worker drivesdegraded→indexedretry anddegraded→failedretry-exhaustion.
April 2026
- Content Filtering — Pre-extraction content filtering removes non-essential content (table of contents, changelogs, legal boilerplate, etc.) before entity extraction while keeping it searchable via RAG. 15 built-in categories with domain-specific exclusion rules. Enabled by default on upload, configurable per source.
- Domain Extraction Limits — Each extraction domain now defines hard caps on entity degree, same-pair relationships, total relationship ratio, and per-chunk entity count. Prevents runaway LLM generation and controls graph density per domain. Includes orphan protection to ensure isolated entities keep at least one connection.
- Container Logs & Diagnostics — Logs tab in the web UI with real-time merged logs from all services (Cortex, Neuron, Nginx, Valkey), color-coded rendering, and runtime log level selector with cross-process hot-reload via Valkey pub/sub. Diagnostic export bundles system info, database stats, sanitized settings, logs, queue stats, and service status into a ZIP file.
- Queue Cancellation — Running tasks can now be cancelled, not just queued tasks. Uses a Valkey flag that workers check between processing batches. UI updates immediately while the handler gracefully exits.
- Docker Startup Page — Friendly branded page shown instead of raw 502/503 errors while services start. Shows component health status, live log viewer with colored rendering, and auto-redirects when the app becomes ready.
- Docker Error Pages — Custom branded error pages for all common HTTP error codes (400, 403, 404, 408, 413, 429, 500, 504) with contextual messages and pre-filled GitHub issue templates.
- Security Hardening — Comprehensive security audit with SSRF protection, request body size limits, error message sanitization across all endpoints, CSP headers, exception type leak prevention, and temp file suffix sanitization.
- UI Redesign — Cyberpunk-themed interface overhaul with neon palette, glass effects, ghost components, constellation loading animation, immersive dashboard with ambient graph, omnibar command terminal, frosted glass sidebar, and graph visualization improvements (glow sprites, colored edges, mindmap layout).
- Alembic Migration Framework — Every schema change (columns, tables, constraints) now ships as an Alembic migration file in
packages/core/src/chaoscypher_core/database/migrations/versions/. Cortex runsalembic upgrade headon startup to apply pending migrations. Replaces the earlier reflective auto-migrator, which was retired in April 2026 because it couldn't cover constraint / FK changes and made schema evolution inscrutable. An autogenerate-diff test in CI catches SQLModel changes that lack a matching migration. - v7 Extraction Quality Scoring — Re-weighted grade formula (R 50% / E 35% / T 15%), bell-shaped density score so over-dense graphs are penalized (stops models padding edges for score), and a new structural penalty combining hub-skew and reciprocal-rate signals that catch a single entity being over-connected or the same relationship emitted in both directions.
- MCP Client-Driven Extraction — MCP server defaults to client-driven extraction with no server LLM required. Fixes for anyio deadlocks, status propagation, and processor queue bypass.
- CLI Embedding Config —
chaoscypher setupwizard now configures embedding providers with auto-default to Ollama. - Valkey AOF Repair — All-in-one container automatically validates and repairs corrupted Valkey AOF files on startup. Falls back to clean slate if repair fails. Queue data is transient, so no permanent data is lost.
- Batch Embedding Processing — Concurrent embedding generation with per-chunk progress reporting and configurable batch sizes.
- Codebase Refactoring — Settings consolidation (deduplicated ChunkingSettings, EmbeddingSettings, MCPSettings, PathSettings into core), SourceStatus enum replacing raw strings, cross-package name collision fixes, 338 ruff + 239 mypy error resolutions.
March 2026
- MCP Server — Built-in Model Context Protocol server with 30 tools for AI assistants (Claude Desktop, Cursor, ChatGPT). Supports stdio transport (CLI) and Streamable HTTP (Cortex API). Read-only by default with optional write mode.
- Authentication System — Optional auth with setup wizard, login, user management, API keys, and TLS support.
- Template Visual Identity — Templates now support icon and color fields for visual identification across the graph, search results, and extraction views.
- Vision Processing — Optional vision model support for extracting content from images in PDFs and standalone image files. Includes image gallery on source detail pages.
- Embedding Provider System — Multi-provider embedding support (local CPU, Ollama, OpenAI, Gemini) with configurable model and provider settings.
- Search Index Rebuild — Rebuild search indexes from Settings UI or CLI (
chaoscypher source rebuild-search), with auto-detection of embedding model changes. - System Health Monitoring — Consolidated health check endpoint and UI status dropdown with subsystem diagnostics.
- Ollama Model Management — Pull, remove, and inspect Ollama models directly from the Settings UI.
- Settings Restructure — Settings reorganized into five tabs: General, Models, Search, Access, and Maintenance.
- Local CPU Embedding Service — Dedicated embedding pipeline using sentence-transformers (Qwen/Qwen3-Embedding-0.6B). Multi-provider support (local CPU, Ollama, OpenAI, Gemini) with configurable model and provider settings. No API keys or external services required for the default local mode.
- GraphRAG Search — Graph-enhanced retrieval that fuses knowledge graph traversal with vector search. Uses entity extraction from queries, Personalized PageRank, and Reciprocal Rank Fusion to answer multi-hop questions that pure vector RAG misses.
- DX Zero-Boilerplate Audit — Typed Pydantic return models for all Engine public methods,
ChaosCypherconvenience namespace,check_health()API, and documentation restructuring. - Documentation site — MkDocs Material documentation site with landing page, user guide, API reference, CLI reference, architecture docs, and development guide
- Workflow execution engine — LangGraph-based workflow orchestrator with step execution and state management
- Visual workflow builder — ReactFlow-based drag-and-drop UI for designing workflows
- Compose CLI commands —
chaoscypher compose build/up/down/runfor composition management - ADR-0001: Remove Discovery and Lenses — Removed discovery sessions and lenses features per architectural decision
- ADR-0003: PyMuPDF replacement — Replaced PyMuPDF with alternative PDF processing
- Scoped chat — Chat conversations can be scoped to specific sources or tags for focused AI interaction
- Tag system redesign — Inline tag editor with tags displayed in the sources list
- Source scope enforcement — All graph tools respect source scope filtering
- Production readiness — Lint cleanup and production configuration fixes
Earlier
For detailed release notes, see the public package repositories and the project discussions.
This changelog covers notable feature additions and changes. For detailed technical changes, refer to individual commit messages in the repository.