Skip to main content

Sources

Sources are the foundation of Chaos Cypher. A source is any document — PDF, Word file, web page, image, audio, or video — that you import into the system for indexing, search, and knowledge extraction.

Uploading Sources

Single File Upload

  1. Navigate to Sources in the sidebar
  2. Click Upload
  3. Select a file or drag and drop it into the upload area
  4. The file enters the processing pipeline automatically

Sources page with upload dialog

Batch Upload

  1. Click Upload on the Sources page
  2. Select multiple files (up to 20 per batch) or drag and drop them

Add Source dialog with drag-and-drop area

URL Import

  1. Click Import URL on the Sources page
  2. Paste the URL of a web page
  3. The system fetches the page, extracts clean text content, and processes it like any other source

The system fetches the page, extracts clean text content, and processes it like any other source. URLs must be HTTP/HTTPS and the extracted content must be at least 50 characters.

Supported File Types

CategoryFormats
DocumentsPDF, DOCX (Word), EPUB
OfficeXLSX (Excel), PPTX (PowerPoint)
TextPlain text (.txt), Markdown (.md), Log files (.log)
Web / MarkupHTML (.html / .htm / .xhtml), reStructuredText (.rst)
DataCSV, JSON, JSONL, NDJSON
ImagesJPEG, PNG, GIF, WebP, TIFF, BMP (text extracted via OCR)
AudioMP3, WAV, M4A, FLAC, OGG, WMA, AAC (transcribed to text)
VideoMP4, MKV, AVI, MOV, WebM, WMV, FLV (audio extracted and transcribed)
ArchivesZIP, TAR.GZ (files extracted and processed individually)
File size limit

The HTTP upload limit is 10 GB per request by default (batching.max_request_body_mb). The processing pipeline supports files up to 100 GB (source_processing.max_file_size_gb). To change the upload limit, adjust both batching.max_request_body_mb in settings.yaml and client_max_body_size in your Nginx configuration.

Archive handling

Archives are automatically extracted and the system detects the documentation format — Sphinx HTML, Markdown docs (MkDocs, Docusaurus), OpenAPI specs, or mixed files. Each file is processed according to its detected type.

Vision Processing

When enable_vision is enabled during upload, images embedded in PDFs and standalone image files are processed with the configured vision model. The vision model generates textual descriptions of visual content — diagrams, charts, photographs, and other images — which are included in the document chunks. This improves search and RAG quality by making visual content discoverable through text queries.

Vision processing is optional and requires a vision model to be configured in Settings > Models. For sources processed with vision enabled, an image gallery is available on the source detail overview page, showing all extracted images alongside their generated descriptions.

The choice you make at upload time is persisted on the source row, so re-extract and recovery honor it without you having to re-pass the flag.

Upload settings are persistent

Every choice you make on the upload dialog is stored on the source row, not on the in-flight queue payload. That means:

  • Recovery after a worker restart re-reads your settings instead of resetting everything to defaults.
  • Retry an errored source preserves your normalization, vision, and content-filtering choices.
  • Re-extract runs the new extraction with the same settings the source was uploaded with — unless you explicitly override one for the re-extract call.

The persisted fields are:

FieldWhat it controls
auto_analyzeWhether the upload flow auto-queues entity extraction after indexing.
enable_normalizationRun the cleaner pipeline (encoding fixes, OCR cleanup, paragraph dedup). null means "use the file-type default."
enable_visionUse the vision model on images and scanned PDFs.
content_filteringApply domain content-exclusion rules during extraction. Filtered text stays searchable via RAG.
filtering_modeStrictness of post-extraction filters: unfiltered / minimal / lenient / balanced / strict / maximum. See Filtering Modes.

Processing Pipeline

Every source goes through a multi-stage pipeline. The first stage (indexing) runs automatically. The second stage (extraction) is optional and uses AI.

Stage 1: Indexing

Runs automatically after upload. The document is chunked into segments and each chunk is embedded as a vector for semantic search.

What happensOutput
Text extraction from source formatRaw text content
Content normalization (optional)Clean, consistent text
Chunking into segmentsConfigurable size (default ~900 chars)
Vector embedding generationOne embedding per chunk

Time: ~30 seconds for a 100-page PDF.

After indexing, the source is searchable via RAG — you can chat about it and search it immediately, without waiting for extraction.

Stage 2: Entity Extraction (Optional)

Uses an LLM to extract entities (people, organizations, concepts, etc.) and relationships from each chunk.

What happensOutput
Chunk groups sent to LLMEntity + relationship extraction
Template matchingConsistent entity types
DeduplicationMerged duplicate entities
Relationship mappingConnections between entities
Entity embeddingsVector embeddings for graph entities

Time: ~5 minutes for a 100-page PDF (depends on LLM speed).

Controls:

  • Extraction depthfull (comprehensive, default) or quick (faster, fewer entities)
  • Domain selection — Auto-detect or force a specific domain (e.g., technical, medical, legal)
  • Cancel — Extraction can be cancelled mid-process. Completed chunks are preserved and the source reverts to indexed status (RAG still works).

Stage 3: Committing

Automatically runs after extraction. Imports extracted entities and relationships into the knowledge graph.

What happensOutput
Create graph nodesOne node per entity
Create graph edgesOne edge per relationship
Create templatesNode type templates for the graph
Link to sourceDocument node with provenance

Status Flow

Every source exposes a progress object with a 4-phase summary designed for the UI:

PhaseSearchableMeaning
waiting_to_indexNoPending or failed — nothing useful exists yet
indexingNoChunking and embedding in progress
extractingYesIndexed and searchable; extraction is optional and may be running
readyYesFully committed to the knowledge graph

The is_searchable flag in progress lets the UI immediately determine whether the source can be queried, without parsing individual status strings.

Internal status values (for debugging and API consumers)
StatusMaps toMeaning
pendingwaiting_to_indexUploaded, waiting to start
indexingindexingChunking and embedding in progress
indexedextractingSearchable via RAG, ready for extraction
extractingextractingLLM entity extraction in progress
mcp_extractingextractingModel Context Protocol (MCP)-driven entity extraction in progress
extractedextractingExtraction complete, ready to commit
committingextractingImporting entities into knowledge graph
committedreadyFully processed
errorwaiting_to_indexFailed at some stage (check error_message)
tip

A source at indexed status is already useful — you can search it and chat about it. Extraction adds the knowledge graph layer on top.

Technical deep-dive

For detailed architecture of each pipeline stage, see the Extraction Pipeline Architecture.

See also

Tags

Tags help organize sources into categories. Each tag has a name, optional color, and optional description.

  • Create, edit, and delete tags from the Sources page
  • Assign tags to sources by clicking the tag icon
  • Filter the source list by tag to find related documents
  • Tags support inline editing directly in the UI

Sources list with status and entity counts

Tags are scoped to the current database — each database has its own set of tags.

Content Filtering

Content filtering removes non-essential content — table of contents, changelogs, legal boilerplate, bibliography sections, and similar noise — before entity extraction. This improves extraction quality by focusing the LLM on meaningful content and reducing token waste.

Key points:

  • Enabled by default on all uploads (content_filtering=true)
  • Filtered content is only removed from extraction — it remains fully searchable via RAG
  • Each extraction domain defines which content categories to exclude
  • 15 built-in categories cover common non-essential patterns (TOC, legal, boilerplate, etc.)
  • Domains can also define custom regex patterns for domain-specific filtering

To disable content filtering for a specific upload (e.g., when the "boilerplate" content is actually important):

Uncheck Content Filtering in the upload dialog.

Technical deep-dive

For details on the filtering pipeline, categories, and custom patterns, see the Content Filtering section in the Extraction Pipeline Architecture.

Content Normalization

Uploaded content is normalized to fix common issues:

  • Encoding fixes (UTF-8 normalization)
  • Whitespace normalization (consistent line breaks, trimmed excess)
  • OCR artifact cleanup

Normalization defaults off for .csv, .tsv, .json, .jsonl, .ndjson, .xml. You can override per-upload.

To override the default for a specific file:

# Force normalization on for a CSV
curl -X POST http://localhost:8080/api/v1/sources \
-F "file=@data.csv" \
-F "enable_normalization=true"

# Force normalization off for a PDF
curl -X POST http://localhost:8080/api/v1/sources \
-F "file=@report.pdf" \
-F "enable_normalization=false"

Duplicate Detection

When uploading with skip_duplicates=true, the system computes a SHA-256 hash of the file content and skips files that already exist in the database. This prevents accidentally importing the same document twice.

Managing Sources

Enable/Disable

Toggle a source's enabled flag to include or exclude it from:

  • Knowledge graph visibility
  • AI chat context
  • Search results

Toggle the enable/disable switch on any source in the Sources list.

Sources list showing active status toggles

Disabling a source doesn't delete it — the data is preserved and can be re-enabled at any time.

Deletion

Deleting a source permanently removes:

  • The source file record
  • All document chunks and embeddings
  • Associated citations
  • Extraction job data

Click the delete button on a source and confirm the deletion.

warning

Deletion is permanent and cannot be undone. Graph nodes and edges created from the source remain in the knowledge graph after deletion.

Abort Processing

If a source is stuck or you need to stop processing:

  • Cancel extraction — Stops entity extraction, preserves completed chunks, reverts to indexed
  • Abort all processing — Cancels any in-progress stage and marks the source with an error

Monitoring Extraction

For detailed extraction monitoring, you can view:

  • Task list — Individual chunk extraction tasks with status, timing, and entity counts
  • Stats — Aggregate statistics (average tokens, duration, entities per chunk)
  • Charts — Visual progress data for the extraction pipeline

Open a source's detail view to see extraction progress, task breakdown, and statistics.

Source detail overview with extraction statistics