Skip to main content

Configuration

Chaos Cypher is configured through a YAML settings file and environment variables. Most settings can also be changed from the web UI Settings page at runtime.

Strict configuration validation

Unknown top-level keys in settings.yaml raise ConfigError at startup with a Levenshtein-based suggestion. For example, a typo like embedding_settings: instead of embedding: produces:

ConfigError: Unrecognized top-level setting(s) in /data/settings.yaml:
- embedding_settings (did you mean 'embedding'?)

This prevents misconfigured deployments from silently falling back to defaults.

Settings File

The primary configuration file is settings.yaml, located in the data directory. The location depends on how you run Chaos Cypher:

  • Docker (all-in-one): /data/settings.yaml (inside the container, persisted via volume mount)
  • Docker (multi-container): packages/docker/data/settings.yaml (created at first startup, gitignored)
  • Local / CLI: Platform-specific data directory (e.g., ~/.local/share/chaoscypher/settings.yaml on Linux, %LOCALAPPDATA%\chaoscypher\settings.yaml on Windows)

The file is auto-generated with sensible defaults on first startup. You can also configure settings from the web UI Settings page.

Settings follow a nested structure matching the settings groups below. Any setting not specified uses its default value.

# Example settings.yaml
current_database: default
dark_mode: true

llm:
chat_provider: ollama
ollama_chat_model: qwen3:30b-instruct

embedding:
provider: local
model: Qwen/Qwen3-Embedding-0.6B

search:
enable_vector_search: true
min_similarity_threshold: 0.55

chunking:
small_chunk_size: 900
small_chunk_overlap: 150

LLM Configuration

Controls which LLM provider is used for chat and extraction.

Provider Selection

llm:
chat_provider: ollama # ollama | openai | anthropic | gemini

Ollama (Default)

Ollama is configured exclusively through the ollama_instances list. The backend always seeds a single default instance pointed at the Docker host, so a minimal config only needs the model name:

llm:
chat_provider: ollama
ollama_chat_model: qwen3:30b-instruct
ollama_extraction_model: null # Uses chat model if null

To override the default URL (e.g. talking to an Ollama on another host), edit the seeded instance:

llm:
chat_provider: ollama
ollama_chat_model: qwen3:30b-instruct
ollama_instances:
- id: default
name: Default
base_url: http://my-ollama-host:11434

For multi-GPU setups, add additional instances and the load balancer will distribute requests across them:

llm:
ollama_instances:
- id: gpu-1
name: Primary GPU
base_url: http://192.168.1.10:11434
- id: gpu-2
name: Secondary GPU
base_url: http://192.168.1.11:11434
ollama_load_balancing: round_robin # round_robin | least_loaded | random

Ollama URLs are configured via ollama_instances. Each instance is a separate backend; the load balancer selects one per request.

OpenAI

llm:
chat_provider: openai
openai_api_key: sk-...
openai_chat_model: gpt-4.1

Anthropic

llm:
chat_provider: anthropic
anthropic_api_key: sk-ant-...
anthropic_chat_model: claude-sonnet-4-5

Gemini

llm:
chat_provider: gemini
gemini_api_key: ...
gemini_chat_model: gemini-2.5-pro

LLM Behavior

llm:
ai_temperature: 0.3 # Chat temperature (0.0-1.0)
extraction_temperature: 0.1 # Extraction temperature (lower = more deterministic)
ai_max_tokens: 65536 # Max output tokens
ai_context_window: 8192 # Context window size

Chunking

Controls how documents are split into chunks for indexing and extraction.

Three knobs cover most tuning needs:

chunking:
small_chunk_size: 900 # Target chunk size in characters (~225 tokens)
small_chunk_overlap: 150 # Overlap between consecutive chunks (~16%)
group_size: 4 # Chunks per extraction group sent to the LLM
SettingDefaultDescription
small_chunk_size900Target size of each chunk in characters. Larger chunks give the LLM more context per call; smaller chunks improve RAG retrieval precision.
small_chunk_overlap150Characters of overlap between consecutive chunks (~16%). Prevents entities from being split across chunk boundaries.
group_size4How many small chunks are grouped together for a single LLM extraction call. Higher = more context per call (better relationship discovery); lower = faster and cheaper.
Advanced chunking knobs
chunking:
min_chunk_size: 500 # Don't create chunks smaller than this
max_chunk_size: 1100 # Hard upper limit per chunk
respect_boundaries: true # Break at sentence/paragraph boundaries
group_overlap: 1 # Overlap between consecutive groups (sliding window)
SettingDefaultDescription
min_chunk_size500Minimum chunk size. Chunks smaller than this are merged with the next chunk. Prevents tiny trailing chunks that add noise.
max_chunk_size1100Hard upper limit. Chunks are split before this size even if it would break a sentence.
respect_boundariestrueWhen True, the chunker tries to break at sentence or paragraph boundaries rather than mid-word. Recommended.
group_overlap1Groups use a sliding window with this many chunks of overlap. 0 = non-overlapping groups; 1 (default) = each group shares one chunk with the previous.

Embeddings

By default, embeddings are generated locally on the CPU using sentence-transformers, requiring no API keys or network access. Alternative providers (Ollama, OpenAI, Gemini) can be configured for cloud-based embedding.

embedding:
provider: local # local | ollama | openai | gemini
model: Qwen/Qwen3-Embedding-0.6B # HuggingFace model ID (local) or provider model name
api_key: null # For cloud providers
api_base: null # Custom endpoint override
ollama_instance_id: default # Ollama instance for embedding
max_text_length: 16000 # Max characters before truncation
SettingDefaultDescription
providerlocalEmbedding provider: local, ollama, openai, or gemini
modelQwen/Qwen3-Embedding-0.6BHuggingFace model ID (local) or provider model name
api_keynullAPI key for cloud providers
api_basenullCustom endpoint override
ollama_instance_iddefaultOllama instance to use for embedding
max_text_length16000Max characters before truncation

The model downloads automatically on first use and is cached at data/models/embeddings/. Subsequent runs load from cache.

search:
enable_vector_search: true # Enable semantic search
vector_dimensions: 1024 # Embedding dimensions
min_similarity_threshold: 0.55 # Minimum similarity for results
max_search_results: 100 # Maximum results returned
enable_rerank: true # Re-rank results for relevance
rerank_model_name: Alibaba-NLP/gte-reranker-modernbert-base

Source Processing

source_processing:
auto_extract_entities: true # Auto-start extraction after indexing
source_processing_analysis_depth: full # full | quick
entity_deduplication_mode: semantic # exact | semantic
entity_deduplication_similarity_threshold: 0.90
relationship_confidence_threshold: 0.5
# Per-source extraction quality overrides (max_relationship_ratio, etc.)

Filtering-mode knobs

The filtering mode you pick at upload time selects a preset bundle. The three knobs that actually move with the slider are wired through to the extraction pipeline:

KnobRangeEffect
loop_max_entity_count25–200Aborts a chunk whose LLM stream emits more entity lines than the cap. Catches degenerate loops earlier in stricter modes.
semantic_dedup_threshold0.85–0.99Cosine-similarity bar for merging two entities semantically. Lower = more aggressive merging.
minimum_alias_length1–3Drops short aliases (AI, ML) in stricter modes to keep the alias index focused on full names.

These were defined on the settings model for some time but were silently ignored pre-W4. As of May 2026 every preset's value reaches the extraction pipeline, so changing the filtering mode produces distinguishable extraction results.

You'll usually pick a filtering mode rather than override these knobs individually — see the Filtering Modes reference for the full preset matrix.

Deprecated: source_processing_max_file_size_gb

source_processing.source_processing_max_file_size_gb was the legacy file-upload cap. As of 2026-05-06 it is deprecated and no longer honored — the upload pipeline (file uploads, URL fetches, MCP) now reads batching.max_upload_bytes exclusively (see below). Remove the deprecated key from your settings.yaml to silence the startup warning; the cap is now uniform across entry paths.

Upload Limits

There are two upload caps, and the request is rejected by whichever fires first:

  • batching.max_upload_bytes (default 500 MB) — application-layer cap enforced during streaming for both file uploads (POST /sources) and URL fetches (POST /sources/url). This is the single source of truth for "how big can one upload be."
  • batching.max_request_body_mb (default 10240 MB / 10 GB) — outer HTTP request body limit; covers multipart overhead and metadata. Should be at or above max_upload_bytes.
  • nginx client_max_body_size (default 10g in nginx-http.conf / nginx-https.conf) — must also be at or above the application limit.

To raise the upload size, increase all three (the request is rejected by whichever layer has the lower limit):

1. Application layer (settings.yaml):

batching:
max_upload_bytes: 524288000 # 500 MB (in bytes) — unified file + URL cap
max_request_body_mb: 10240 # 10 GB (in MB) — outer HTTP body limit
max_upload_files: 20 # Max files per batch upload

2. Nginx layer (Docker only — nginx-http.conf and nginx-https.conf):

client_max_body_size 10g;
Both layers must match

The request is rejected by whichever layer has the lower limit. If you increase the application limit but not Nginx, uploads will still fail at the Nginx layer.

Queue

Valkey connection for the background job queue.

queue:
queue_host: valkey
queue_port: 6379
queue_database: 0

Logging

Runtime Log Level

The log level can be changed at runtime from the web UI or API — no restart required. The change propagates to all processes (Cortex and Neuron) via Valkey pub/sub.

Open Settings > General and select a log level from the dropdown.

Available levels: DEBUG, INFO, WARNING, ERROR, CRITICAL.

Container Logs

In the all-in-one container, the Logs tab (Settings page) shows real-time merged logs from all services — Cortex, Neuron, Nginx, and Valkey — with color-coded rendering and service labels.

Diagnostics Export

Export a diagnostic bundle for troubleshooting:

Open Settings and click Export Diagnostics.

The ZIP file includes system info, database statistics, sanitized settings (secrets masked), log files, queue stats, and service status.

Environment Variables

These environment variables are used by the Docker services:

Application:

VariableDefaultDescription
PYTHONUNBUFFERED1Ensure proper log output
PYTHONPATH/appModule resolution
QUEUE_HOSTvalkeyValkey hostname
QUEUE_PORT6379Valkey port
QUEUE_DB0Valkey database number
QUEUE_PASSWORDAuto-generatedValkey password (auto-generated on first container start, stored in /data/.credentials)
VITE_API_URLhttp://cortex:8080Frontend API target
USE_JSON_LOGGINGfalseJSON logs for production
LOG_LEVELINFOLogging level

Infrastructure (all-in-one container only):

VariableDefaultDescription
NGINX_LOGLEVELwarnNginx error log level
NGINX_ACCESS_LOGoffNginx access log on/off
NGINX_RATE_LIMITINGonNginx rate limiting on/off
VALKEY_LOGLEVELwarningValkey log level
SUPERVISOR_LOGLEVELwarnSupervisord log level

Rate Limiting

Rate limiting is enabled by default. To disable it for local/single-user development, set NGINX_RATE_LIMITING=off in your docker-compose.yml or .env file:

environment:
- NGINX_RATE_LIMITING=off

When enabled, the following rate limits apply:

ZonePathRateBurst
Auth/api/v1/auth/5 r/s3
Uploads/api/v1/sources10 r/s5
General API/ (catch-all)100 r/s50
Static assets/assets/No limit

Rate limits are per client IP. Enable this if you expose Chaos Cypher to the internet or untrusted networks. A container restart is required after changing this setting.

Worker Configuration

Workers can be configured separately via workers.yaml (optional, requires restart):

llm_worker:
max_concurrent: 1 # Concurrent LLM jobs
max_tries: 5 # Max attempts for LLM jobs
timeout: 600 # Job timeout in seconds

operations_worker:
max_concurrent: 8 # Concurrent operation jobs
max_tries: 5 # Max attempts for operations
timeout: 7200 # Job timeout in seconds

MCP Server

Settings for the built-in Model Context Protocol server that enables AI assistant integration.

MCP:
mode: read # "read" or "write"
auto_extract: false # Run entity extraction after document upload
SettingDefaultDescription
modereadTool access level. read exposes 19 search/query tools. write exposes all 30 tools, adding 11 for create, update, delete, and document upload.
auto_extractfalseAutomatically run entity extraction after indexing documents uploaded via MCP.

See MCP Server for setup and usage details.

HTTPS / TLS

The all-in-one container auto-detects TLS certificates and switches to HTTPS:

  1. Place your certificate and key at /data/certs/server.crt and /data/certs/server.key
  2. Restart the container
  3. Nginx automatically enables HTTPS with HTTP→HTTPS redirect

The container checks for certificates on every startup — no configuration flag needed. Remove the cert files to revert to HTTP.

All Settings Groups

For a complete reference of all available settings and their defaults, see the settings class definitions in packages/cortex/src/chaoscypher_cortex/shared/config/__init__.py.

GroupKey Settings
llmProvider selection, model names, API keys, temperature, token limits
mcpMCP server mode and document processing behavior
queueValkey connection details
chunkingChunk sizes, overlap, boundary handling
searchVector search, re-ranking, similarity thresholds
embeddingEmbedding provider, model, dimensions, max text length
source_processingExtraction behavior, deduplication, quality controls
exportPackage metadata for CCX exports
lexiconLexicon Hub connection settings
pathsData directory structure
timeoutsAPI, worker, and health check timeouts
portsService ports
batchingUpload limits, embedding batches, processing batches
paginationPage sizes and limits
corsCross-origin request settings
authAuthentication (enabled by default)
retriesRetry counts for various operations
backoffExponential backoff configuration
analysisGraph analysis settings
chat_contextChat context window and history limits
servicesExternal service URLs
workersWorker concurrency and timeout defaults
Security defaults

By default, Cortex binds to 0.0.0.0. Read the self-hosted threat model before exposing the service beyond loopback.

See also

  • API reference: Settings — read and update configuration at runtime via the REST API; VRAM presets, Ollama model management, and reset operations