Skip to main content

Loading

Document loading is the first step of the extraction pipeline. It converts raw files from any supported format into a uniform list[dict] structure with content (text) and metadata keys. This normalized output feeds directly into the normalization and chunking stages.

When Loading Runs

Loading happens inside the handle_index_document() handler on the Operations queue. The handler calls LoaderRegistry.load_document(filepath), which:

  1. Looks up the file extension in the registry
  2. Instantiates the appropriate loader
  3. Calls loader.load_document() to extract text
  4. Returns the raw document(s) for downstream normalization
Loading returns raw content

The registry returns raw documents. Chunking is handled separately by ChunkingService after normalization, using a hierarchical chunking strategy suited to the document's content.

Loader Selection Flow

LoaderRegistry

The LoaderRegistry extends BaseRegistry[BaseLoader] and auto-discovers loaders at initialization by scanning two directories:

  1. Built-in loaders -- the loaders/ directory in chaoscypher_core
  2. User plugins -- data/plugins/loaders/ (any file matching *_loader.py)

Discovery works by finding classes that have a supported_extensions property, instantiating them, and registering each extension as a key. User plugins override built-in loaders with the same extension.

Singleton Caching

The registry is expensive to create (~10-35ms for module scanning and class inspection), so it is cached per EngineSettings instance via get_loader_registry(settings). Worker startup creates one registry that is reused for all document imports.

from chaoscypher_core import Loaders

# Recommended: one-liner for file loading
text = Loaders.load_text("/path/to/file.pdf")

For advanced loader plugin development, the registry is also accessible:

from chaoscypher_core.services.sources.loaders import get_loader_registry

registry = get_loader_registry(engine_settings)
documents = registry.load_document("/path/to/file.pdf")

Built-in Loaders

PdfLoader

PropertyValue
Extensions.pdf
Librarypypdf (BSD-3, ADR-0003)
OCR SupportNo
OutputPlain text

Extracts text page-by-page using PdfReader, combines pages with double-newline separators. Includes metadata: page count, character count, extraction speed, title/author from PDF metadata.

pypdf preserves prose accurately but loses heading structure, which lowers the source's structure_score. The Interface explains this inline with an info tooltip on the affected source-quality indicator.

TextLoader

PropertyValue
Extensions.txt, .md, .log
LibraryBuilt-in open()
OCR SupportNo
OutputRaw file content

Reads UTF-8 text files with errors="replace" for encoding tolerance. The simplest loader -- returns the full file content as a single document.

CSVLoader

PropertyValue
Extensions.csv
Librarystdlib csv
OCR SupportNo
OutputOne document per row

Reads through csv.Sniffer to detect the dialect (delimiter, quoting style) so files using semicolons or tabs decode correctly without configuration. Each row becomes a separate document — structured data benefits from row-level granularity rather than treating the entire file as a text blob. Routes through detect_encoding() so cp1252 / Latin-1 exports keep their characters.

Normalization is typically skipped for CSV

The upload API accepts enable_normalization=false, which is recommended for CSV and JSON files to preserve their exact structure.

JSONLoader

PropertyValue
Extensions.json, .jsonl, .ndjson
Librarystdlib json
OCR SupportNo
OutputOne document for .json; one document per line for .jsonl / .ndjson

Branches on extension. .json files are parsed as a single document via json.loads. .jsonl and .ndjson files are parsed line-by-line with per-line error isolation — one bad line records a loader_warnings_count increment but the rest of the file continues. Routes through detect_encoding() for non-UTF-8 exports. Raises ValidationError when every JSONL line fails to parse, otherwise attaches a loader_warnings metadata key to the first surviving document so the indexing handler surfaces partial failures.

HTMLLoader

PropertyValue
Extensions.html, .htm, .xhtml
Librarybeautifulsoup4
OCR SupportNo
OutputVisible body text + <title> metadata

Strips chrome elements (script, style, nav, aside, footer, header, noscript) to extract clean prose. More aggressive than the archive's Sphinx handler — standalone HTML uploads are typically blog posts or article pages where the user's intent is the main content. Captures <title> in metadata and decomposes it from the soup so the title text doesn't appear twice.

RSTLoader

PropertyValue
Extensions.rst
Librarydocutils
OCR SupportNo
OutputPlain text rendered from reStructuredText

Renders reStructuredText source to plain text via docutils, handling directives like code-block, note, and image. Routes through detect_encoding() for legacy Windows-encoded .rst exports.

DOCXLoader

PropertyValue
Extensions.docx
Librarypython-docx
OCR SupportNo
OutputHeadings, paragraphs, list items, and tables flattened to text

Iterates over the document's body in order, preserving heading hierarchy as Markdown-style # headers. Tables are rendered as tab-separated rows. Empty paragraphs are dropped to avoid noise.

XLSXLoader

PropertyValue
Extensions.xlsx, .xlsm
Libraryopenpyxl
OCR SupportNo
OutputOne document per worksheet

Reads each worksheet in read_only=True, data_only=True mode (so cell values come through as their final computed form, not formula strings). Rows are joined with tabs and sheets are joined with double newlines. Sheet name is preserved in document metadata.

PPTXLoader

PropertyValue
Extensions.pptx
Librarypython-pptx
OCR SupportNo
OutputOne document per slide with shape text concatenated

Each slide becomes its own document. Shape text is concatenated in slide order. Slide notes are included when present. Slide index is preserved in document metadata.

EPUBLoader

PropertyValue
Extensions.epub
Library(stdlib zipfile + xml.etree)
OCR SupportNo
OutputOne document per chapter

Reads the EPUB container directly as a ZIP archive and parses each XHTML chapter. Hand-rolled rather than using ebooklib to avoid taking on an AGPL dependency that would conflict with the project's permissive-dependency policy (see ADR-0002). Chapter title and order are preserved in metadata.

ImageLoader

PropertyValue
Extensions.jpg, .jpeg, .png, .gif, .webp, .tiff, .tif, .bmp
Librarypytesseract + Pillow
OCR SupportYes
OutputOCR-extracted text

Performs Tesseract OCR on image files. Includes image metadata (dimensions, format, mode) alongside the extracted text. Requires the tesseract-ocr system package.

Vision Processing: When enable_vision=true is set during upload, the VisionService generates LLM-powered textual descriptions of images, augmenting text extraction for better RAG retrieval.

AudioLoader

PropertyValue
Extensions.mp3, .wav, .m4a, .flac, .ogg, .wma, .aac
Libraryffmpeg + faster-whisper
OCR SupportNo
OutputTranscribed text

Converts audio to 16kHz mono WAV via ffmpeg, then transcribes using the Whisper base model (CPU, no GPU required). Includes metadata: duration, detected language, segment count. The Whisper model is lazily loaded and cached as a class variable.

VideoLoader

PropertyValue
Extensions.mp4, .mkv, .avi, .mov, .webm, .wmv, .flv
Libraryffmpeg + faster-whisper
OCR SupportNo
OutputTranscribed audio track

Extracts the audio track from video files via ffmpeg, then transcribes identically to AudioLoader. Shares the same cached Whisper model instance.

ArchiveLoader

PropertyValue
Extensions.zip, .tar.gz, .tgz
Libraryzipfile / tarfile + format handlers
OCR SupportNo
OutputConcatenated documents from archive contents

Handles documentation archives with intelligent format detection. See Archive Handling below.

Archive Handling

Archives receive special treatment because they may contain structured documentation rather than arbitrary files. The ArchiveLoader orchestrates a multi-step process:

Secure Extraction

ArchiveExtractor extracts archives to a temporary directory with security validation:

  • Path traversal prevention -- rejects members with .. in paths
  • Absolute path rejection -- rejects members with absolute paths
  • Symlink validation -- prevents symlink-based attacks
  • Size limits -- configurable maximum extraction size
  • File count limits -- configurable maximum number of files

The temporary directory is always cleaned up in a finally block after processing.

Format Detection

DocumentationDetector uses heuristic scoring to identify the archive format. Detection runs in priority order:

FormatIndicatorsConfidence Signals
OpenAPIopenapi.json/yaml, swagger.json/yaml at root or nestedValidated by checking for openapi or swagger keys in the file
Sphinx HTML_static/ directory, genindex.html, searchindex.js, .doctrees/, Sphinx CSS filesEach indicator adds 0.1-0.3 confidence; threshold is 0.5
Markdown10+ .md/.mdx files, docs/ directory, mkdocs.yml, docusaurus.config.jsFile count and config files contribute to confidence score
GenericNo specific patterns matchedFallback at 0.1 confidence

Detection also identifies the root path -- the subdirectory where documentation actually starts (e.g., docs/_build/html/ for Sphinx). This prevents handlers from processing non-documentation files.

Format Handlers

Each handler implements the ArchiveHandler protocol:

class ArchiveHandler(Protocol):
@property
def name(self) -> str: ...
def can_handle(self, extracted_dir: Path) -> tuple[bool, float]: ...
def process(self, extracted_dir: Path, settings: Any) -> list[dict[str, Any]]: ...
  • SphinxHTMLHandler -- Parses Sphinx HTML documentation, extracting content from article elements and preserving navigation hierarchy
  • MarkdownHandler -- Reads markdown files preserving directory structure as hierarchy metadata
  • OpenAPIHandler -- Parses OpenAPI/Swagger specifications into readable documentation
  • GenericHandler -- Falls back to processing each file individually via the LoaderRegistry

All handlers add archive_file, detection_format, and detection_confidence to each document's metadata.

Error Handling

ScenarioBehavior
Unsupported extensionLoaderRegistry.load_document() raises ValueError with supported extensions list
File not foundFileNotFoundError raised
Loader dependency missingLoader-specific handling — wrapped in ValidationError with an actionable install hint
Library raises during parseWrapped in ValidationError so the indexing handler records a clean error_message instead of a third-party stack trace
Scanned PDF (zero text post-extract)PdfLoader raises a specific ValidationError ("scanned PDF — enable vision to extract content") rather than returning an empty document
Archive extraction failureArchiveExtractionError raised; temp directory cleaned up
Archive security violationArchiveSecurityError raised (path traversal, absolute paths)
Empty documentReturns empty list []; logged as warning

All errors propagate up to handle_index_document(), which catches them and calls adapter.fail_indexing(file_id, error_message) to set the source status to error with error_stage="indexing".

application/octet-stream is not in the default upload allowlist

The default batching.allowed_content_types list does not include application/octet-stream — including it would defeat the allowlist (the browser sends octet-stream for any binary it doesn't recognize). Operators who genuinely need to accept arbitrary binaries can add it via settings.yaml.

Custom Loaders

To add support for a new file format, create a *_loader.py file in data/plugins/loaders/. The conventions below match what the built-in W7 loaders (HTML / RST / DOCX / XLSX / PPTX / EPUB) follow:

  • Wrap library errors in ValidationError so the indexing handler can record an actionable error message instead of leaking a third-party stack trace.
  • Use detect_encoding() for any text format where the user might supply a non-UTF-8 file (CSV, JSON, HTML, RST, plain text). Pair it with set_loader_encoding() so the encoding the loader actually used surfaces on the source's loader_encoding_used quality counter.
  • Raise specific errors when content is empty post-extraction. A scanned PDF that produces zero text is a different failure from "the file is corrupt" — give the user a hint they can act on.
from pathlib import Path

from chaoscypher_core.exceptions import ValidationError
from chaoscypher_core.plugins import PluginMetadata
from chaoscypher_core.utils.encoding import detect_encoding


class ExcelLoader:
@property
def metadata(self) -> PluginMetadata:
return PluginMetadata(
plugin_id="excel",
name="Excel Loader",
description="Loads Excel spreadsheets",
category="loader",
)

@property
def supported_extensions(self) -> list[str]:
return [".xlsx", ".xls"]

def __init__(self, settings=None):
self.settings = settings

def load_document(self, filepath: str) -> list[dict]:
try:
from openpyxl import load_workbook
except ImportError as exc:
raise ValidationError(
"openpyxl is required for Excel loading."
) from exc

try:
workbook = load_workbook(filepath, read_only=True, data_only=True)
except Exception as exc:
# Wrap library errors so the indexing handler reports a clean
# error_message instead of an opaque library trace.
raise ValidationError(f"Could not open Excel file: {exc}") from exc

documents = [...]
if not documents:
raise ValidationError(
"Excel file produced zero text — every cell was empty."
)
return documents

def supports_ocr(self) -> bool:
return False

The file will be automatically discovered and registered on the next worker restart. User plugins override built-in loaders that handle the same extensions.

For a worked example mirroring the shipped xlsx_loader.py, see Building Document Loaders.

Code Locations

ComponentPath
BaseLoader Protocolpackages/core/src/chaoscypher_core/services/sources/loaders/base.py
LoaderRegistrypackages/core/src/chaoscypher_core/services/sources/loaders/registry.py
Registry Factorypackages/core/src/chaoscypher_core/services/sources/loaders/factory.py
Encoding helperpackages/core/src/chaoscypher_core/utils/encoding.py
PdfLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/pdf_loader.py
TextLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/text_loader.py
CSVLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/csv_loader.py
JSONLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/json_loader.py
HTMLLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/html_loader.py
RSTLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/rst_loader.py
DOCXLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/docx_loader.py
XLSXLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/xlsx_loader.py
PPTXLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/pptx_loader.py
EPUBLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/epub_loader.py
ImageLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/image_loader.py
AudioLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/audio_loader.py
VideoLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/video_loader.py
ArchiveLoaderpackages/core/src/chaoscypher_core/services/sources/loaders/archive_loader.py
Archive Detectorpackages/core/src/chaoscypher_core/services/sources/loaders/archive/detector.py
Archive Extractorpackages/core/src/chaoscypher_core/services/sources/loaders/archive/extractor.py
Archive Handlerspackages/core/src/chaoscypher_core/services/sources/loaders/archive/handlers/