Loading
Document loading is the first step of the extraction pipeline. It converts raw files from any supported format into a uniform list[dict] structure with content (text) and metadata keys. This normalized output feeds directly into the normalization and chunking stages.
When Loading Runs
Loading happens inside the handle_index_document() handler on the Operations queue. The handler calls LoaderRegistry.load_document(filepath), which:
- Looks up the file extension in the registry
- Instantiates the appropriate loader
- Calls
loader.load_document()to extract text - Returns the raw document(s) for downstream normalization
The registry returns raw documents. Chunking is handled separately by ChunkingService after normalization, using a hierarchical chunking strategy suited to the document's content.
Loader Selection Flow
LoaderRegistry
The LoaderRegistry extends BaseRegistry[BaseLoader] and auto-discovers loaders at initialization by scanning two directories:
- Built-in loaders -- the
loaders/directory inchaoscypher_core - User plugins --
data/plugins/loaders/(any file matching*_loader.py)
Discovery works by finding classes that have a supported_extensions property, instantiating them, and registering each extension as a key. User plugins override built-in loaders with the same extension.
Singleton Caching
The registry is expensive to create (~10-35ms for module scanning and class inspection), so it is cached per EngineSettings instance via get_loader_registry(settings). Worker startup creates one registry that is reused for all document imports.
from chaoscypher_core import Loaders
# Recommended: one-liner for file loading
text = Loaders.load_text("/path/to/file.pdf")
For advanced loader plugin development, the registry is also accessible:
from chaoscypher_core.services.sources.loaders import get_loader_registry
registry = get_loader_registry(engine_settings)
documents = registry.load_document("/path/to/file.pdf")
Built-in Loaders
PdfLoader
| Property | Value |
|---|---|
| Extensions | .pdf |
| Library | pypdf (BSD-3, ADR-0003) |
| OCR Support | No |
| Output | Plain text |
Extracts text page-by-page using PdfReader, combines pages with double-newline separators. Includes metadata: page count, character count, extraction speed, title/author from PDF metadata.
pypdf preserves prose accurately but loses heading structure, which lowers the source's structure_score. The Interface explains this inline with an info tooltip on the affected source-quality indicator.
TextLoader
| Property | Value |
|---|---|
| Extensions | .txt, .md, .log |
| Library | Built-in open() |
| OCR Support | No |
| Output | Raw file content |
Reads UTF-8 text files with errors="replace" for encoding tolerance. The simplest loader -- returns the full file content as a single document.
CSVLoader
| Property | Value |
|---|---|
| Extensions | .csv |
| Library | stdlib csv |
| OCR Support | No |
| Output | One document per row |
Reads through csv.Sniffer to detect the dialect (delimiter, quoting style) so files using semicolons or tabs decode correctly without configuration. Each row becomes a separate document — structured data benefits from row-level granularity rather than treating the entire file as a text blob. Routes through detect_encoding() so cp1252 / Latin-1 exports keep their characters.
The upload API accepts enable_normalization=false, which is recommended for CSV and JSON files to preserve their exact structure.
JSONLoader
| Property | Value |
|---|---|
| Extensions | .json, .jsonl, .ndjson |
| Library | stdlib json |
| OCR Support | No |
| Output | One document for .json; one document per line for .jsonl / .ndjson |
Branches on extension. .json files are parsed as a single document via json.loads. .jsonl and .ndjson files are parsed line-by-line with per-line error isolation — one bad line records a loader_warnings_count increment but the rest of the file continues. Routes through detect_encoding() for non-UTF-8 exports. Raises ValidationError when every JSONL line fails to parse, otherwise attaches a loader_warnings metadata key to the first surviving document so the indexing handler surfaces partial failures.
HTMLLoader
| Property | Value |
|---|---|
| Extensions | .html, .htm, .xhtml |
| Library | beautifulsoup4 |
| OCR Support | No |
| Output | Visible body text + <title> metadata |
Strips chrome elements (script, style, nav, aside, footer, header, noscript) to extract clean prose. More aggressive than the archive's Sphinx handler — standalone HTML uploads are typically blog posts or article pages where the user's intent is the main content. Captures <title> in metadata and decomposes it from the soup so the title text doesn't appear twice.
RSTLoader
| Property | Value |
|---|---|
| Extensions | .rst |
| Library | docutils |
| OCR Support | No |
| Output | Plain text rendered from reStructuredText |
Renders reStructuredText source to plain text via docutils, handling directives like code-block, note, and image. Routes through detect_encoding() for legacy Windows-encoded .rst exports.
DOCXLoader
| Property | Value |
|---|---|
| Extensions | .docx |
| Library | python-docx |
| OCR Support | No |
| Output | Headings, paragraphs, list items, and tables flattened to text |
Iterates over the document's body in order, preserving heading hierarchy as Markdown-style # headers. Tables are rendered as tab-separated rows. Empty paragraphs are dropped to avoid noise.
XLSXLoader
| Property | Value |
|---|---|
| Extensions | .xlsx, .xlsm |
| Library | openpyxl |
| OCR Support | No |
| Output | One document per worksheet |
Reads each worksheet in read_only=True, data_only=True mode (so cell values come through as their final computed form, not formula strings). Rows are joined with tabs and sheets are joined with double newlines. Sheet name is preserved in document metadata.
PPTXLoader
| Property | Value |
|---|---|
| Extensions | .pptx |
| Library | python-pptx |
| OCR Support | No |
| Output | One document per slide with shape text concatenated |
Each slide becomes its own document. Shape text is concatenated in slide order. Slide notes are included when present. Slide index is preserved in document metadata.
EPUBLoader
| Property | Value |
|---|---|
| Extensions | .epub |
| Library | (stdlib zipfile + xml.etree) |
| OCR Support | No |
| Output | One document per chapter |
Reads the EPUB container directly as a ZIP archive and parses each XHTML chapter. Hand-rolled rather than using ebooklib to avoid taking on an AGPL dependency that would conflict with the project's permissive-dependency policy (see ADR-0002). Chapter title and order are preserved in metadata.
ImageLoader
| Property | Value |
|---|---|
| Extensions | .jpg, .jpeg, .png, .gif, .webp, .tiff, .tif, .bmp |
| Library | pytesseract + Pillow |
| OCR Support | Yes |
| Output | OCR-extracted text |
Performs Tesseract OCR on image files. Includes image metadata (dimensions, format, mode) alongside the extracted text. Requires the tesseract-ocr system package.
Vision Processing: When enable_vision=true is set during upload, the VisionService generates LLM-powered textual descriptions of images, augmenting text extraction for better RAG retrieval.
AudioLoader
| Property | Value |
|---|---|
| Extensions | .mp3, .wav, .m4a, .flac, .ogg, .wma, .aac |
| Library | ffmpeg + faster-whisper |
| OCR Support | No |
| Output | Transcribed text |
Converts audio to 16kHz mono WAV via ffmpeg, then transcribes using the Whisper base model (CPU, no GPU required). Includes metadata: duration, detected language, segment count. The Whisper model is lazily loaded and cached as a class variable.
VideoLoader
| Property | Value |
|---|---|
| Extensions | .mp4, .mkv, .avi, .mov, .webm, .wmv, .flv |
| Library | ffmpeg + faster-whisper |
| OCR Support | No |
| Output | Transcribed audio track |
Extracts the audio track from video files via ffmpeg, then transcribes identically to AudioLoader. Shares the same cached Whisper model instance.
ArchiveLoader
| Property | Value |
|---|---|
| Extensions | .zip, .tar.gz, .tgz |
| Library | zipfile / tarfile + format handlers |
| OCR Support | No |
| Output | Concatenated documents from archive contents |
Handles documentation archives with intelligent format detection. See Archive Handling below.
Archive Handling
Archives receive special treatment because they may contain structured documentation rather than arbitrary files. The ArchiveLoader orchestrates a multi-step process:
Secure Extraction
ArchiveExtractor extracts archives to a temporary directory with security validation:
- Path traversal prevention -- rejects members with
..in paths - Absolute path rejection -- rejects members with absolute paths
- Symlink validation -- prevents symlink-based attacks
- Size limits -- configurable maximum extraction size
- File count limits -- configurable maximum number of files
The temporary directory is always cleaned up in a finally block after processing.
Format Detection
DocumentationDetector uses heuristic scoring to identify the archive format. Detection runs in priority order:
| Format | Indicators | Confidence Signals |
|---|---|---|
| OpenAPI | openapi.json/yaml, swagger.json/yaml at root or nested | Validated by checking for openapi or swagger keys in the file |
| Sphinx HTML | _static/ directory, genindex.html, searchindex.js, .doctrees/, Sphinx CSS files | Each indicator adds 0.1-0.3 confidence; threshold is 0.5 |
| Markdown | 10+ .md/.mdx files, docs/ directory, mkdocs.yml, docusaurus.config.js | File count and config files contribute to confidence score |
| Generic | No specific patterns matched | Fallback at 0.1 confidence |
Detection also identifies the root path -- the subdirectory where documentation actually starts (e.g., docs/_build/html/ for Sphinx). This prevents handlers from processing non-documentation files.
Format Handlers
Each handler implements the ArchiveHandler protocol:
class ArchiveHandler(Protocol):
@property
def name(self) -> str: ...
def can_handle(self, extracted_dir: Path) -> tuple[bool, float]: ...
def process(self, extracted_dir: Path, settings: Any) -> list[dict[str, Any]]: ...
- SphinxHTMLHandler -- Parses Sphinx HTML documentation, extracting content from article elements and preserving navigation hierarchy
- MarkdownHandler -- Reads markdown files preserving directory structure as hierarchy metadata
- OpenAPIHandler -- Parses OpenAPI/Swagger specifications into readable documentation
- GenericHandler -- Falls back to processing each file individually via the
LoaderRegistry
All handlers add archive_file, detection_format, and detection_confidence to each document's metadata.
Error Handling
| Scenario | Behavior |
|---|---|
| Unsupported extension | LoaderRegistry.load_document() raises ValueError with supported extensions list |
| File not found | FileNotFoundError raised |
| Loader dependency missing | Loader-specific handling — wrapped in ValidationError with an actionable install hint |
| Library raises during parse | Wrapped in ValidationError so the indexing handler records a clean error_message instead of a third-party stack trace |
| Scanned PDF (zero text post-extract) | PdfLoader raises a specific ValidationError ("scanned PDF — enable vision to extract content") rather than returning an empty document |
| Archive extraction failure | ArchiveExtractionError raised; temp directory cleaned up |
| Archive security violation | ArchiveSecurityError raised (path traversal, absolute paths) |
| Empty document | Returns empty list []; logged as warning |
All errors propagate up to handle_index_document(), which catches them and calls adapter.fail_indexing(file_id, error_message) to set the source status to error with error_stage="indexing".
application/octet-stream is not in the default upload allowlistThe default batching.allowed_content_types list does not include application/octet-stream — including it would defeat the allowlist (the browser sends octet-stream for any binary it doesn't recognize). Operators who genuinely need to accept arbitrary binaries can add it via settings.yaml.
Custom Loaders
To add support for a new file format, create a *_loader.py file in data/plugins/loaders/. The conventions below match what the built-in W7 loaders (HTML / RST / DOCX / XLSX / PPTX / EPUB) follow:
- Wrap library errors in
ValidationErrorso the indexing handler can record an actionable error message instead of leaking a third-party stack trace. - Use
detect_encoding()for any text format where the user might supply a non-UTF-8 file (CSV, JSON, HTML, RST, plain text). Pair it withset_loader_encoding()so the encoding the loader actually used surfaces on the source'sloader_encoding_usedquality counter. - Raise specific errors when content is empty post-extraction. A scanned PDF that produces zero text is a different failure from "the file is corrupt" — give the user a hint they can act on.
from pathlib import Path
from chaoscypher_core.exceptions import ValidationError
from chaoscypher_core.plugins import PluginMetadata
from chaoscypher_core.utils.encoding import detect_encoding
class ExcelLoader:
@property
def metadata(self) -> PluginMetadata:
return PluginMetadata(
plugin_id="excel",
name="Excel Loader",
description="Loads Excel spreadsheets",
category="loader",
)
@property
def supported_extensions(self) -> list[str]:
return [".xlsx", ".xls"]
def __init__(self, settings=None):
self.settings = settings
def load_document(self, filepath: str) -> list[dict]:
try:
from openpyxl import load_workbook
except ImportError as exc:
raise ValidationError(
"openpyxl is required for Excel loading."
) from exc
try:
workbook = load_workbook(filepath, read_only=True, data_only=True)
except Exception as exc:
# Wrap library errors so the indexing handler reports a clean
# error_message instead of an opaque library trace.
raise ValidationError(f"Could not open Excel file: {exc}") from exc
documents = [...]
if not documents:
raise ValidationError(
"Excel file produced zero text — every cell was empty."
)
return documents
def supports_ocr(self) -> bool:
return False
The file will be automatically discovered and registered on the next worker restart. User plugins override built-in loaders that handle the same extensions.
For a worked example mirroring the shipped xlsx_loader.py, see Building Document Loaders.
Code Locations
| Component | Path |
|---|---|
| BaseLoader Protocol | packages/core/src/chaoscypher_core/services/sources/loaders/base.py |
| LoaderRegistry | packages/core/src/chaoscypher_core/services/sources/loaders/registry.py |
| Registry Factory | packages/core/src/chaoscypher_core/services/sources/loaders/factory.py |
| Encoding helper | packages/core/src/chaoscypher_core/utils/encoding.py |
| PdfLoader | packages/core/src/chaoscypher_core/services/sources/loaders/pdf_loader.py |
| TextLoader | packages/core/src/chaoscypher_core/services/sources/loaders/text_loader.py |
| CSVLoader | packages/core/src/chaoscypher_core/services/sources/loaders/csv_loader.py |
| JSONLoader | packages/core/src/chaoscypher_core/services/sources/loaders/json_loader.py |
| HTMLLoader | packages/core/src/chaoscypher_core/services/sources/loaders/html_loader.py |
| RSTLoader | packages/core/src/chaoscypher_core/services/sources/loaders/rst_loader.py |
| DOCXLoader | packages/core/src/chaoscypher_core/services/sources/loaders/docx_loader.py |
| XLSXLoader | packages/core/src/chaoscypher_core/services/sources/loaders/xlsx_loader.py |
| PPTXLoader | packages/core/src/chaoscypher_core/services/sources/loaders/pptx_loader.py |
| EPUBLoader | packages/core/src/chaoscypher_core/services/sources/loaders/epub_loader.py |
| ImageLoader | packages/core/src/chaoscypher_core/services/sources/loaders/image_loader.py |
| AudioLoader | packages/core/src/chaoscypher_core/services/sources/loaders/audio_loader.py |
| VideoLoader | packages/core/src/chaoscypher_core/services/sources/loaders/video_loader.py |
| ArchiveLoader | packages/core/src/chaoscypher_core/services/sources/loaders/archive_loader.py |
| Archive Detector | packages/core/src/chaoscypher_core/services/sources/loaders/archive/detector.py |
| Archive Extractor | packages/core/src/chaoscypher_core/services/sources/loaders/archive/extractor.py |
| Archive Handlers | packages/core/src/chaoscypher_core/services/sources/loaders/archive/handlers/ |