Skip to main content

Normalization

Content normalization is the second step of the extraction pipeline, running immediately after document loading and before chunking. It transforms raw extracted text into clean, consistent Markdown output by fixing encoding errors, removing OCR artifacts, stripping HTML boilerplate, and normalizing formatting.

Why Normalization Exists

Raw document text is messy. PDFs produce mojibake and scattered whitespace. OCR outputs gibberish lines and duplicate paragraphs. Web scrapes include navigation, ads, and script tags. Without normalization, these artifacts propagate into chunks and embeddings, degrading RAG search quality and confusing the LLM during entity extraction.

Normalization addresses this by running a configurable pipeline of cleaners and a transformer to produce uniform Markdown output, regardless of the original source format.

When Normalization Runs

Normalization runs inside handle_index_document() on the Operations queue, between loading and chunking. The _extract_text() helper function orchestrates it:

Each document in the loader output is normalized independently. The ContentType is inferred from the file extension, which determines how cleaners and the transformer behave.

When Normalization is Skipped

Normalization is skipped in two cases:

  1. User opt-out -- The upload API accepts enable_normalization=false. This is logged as import_document_normalization_skipped with reason="disabled_by_user".
  2. Structured data -- CSV and JSON files should typically be uploaded with normalization disabled to preserve their exact structure. The UI recommends this for structured formats.
When to disable normalization

Disable normalization for code files, CSV data, JSON data, or any content where exact formatting matters. Normalization is designed for natural language documents (PDFs, web pages, scanned images).

Pipeline Architecture

ContentNormalizerService orchestrates a sequential pipeline of cleaners followed by a single transformer:

Each cleaner receives the output of the previous one and returns a tuple of (cleaned_content, operations_applied). The operations list tracks exactly what was done, enabling quality assessment and debugging.

Content Type Detection

If the caller does not specify a ContentType, the service auto-detects it by examining the content:

DetectionPattern
HTML<!doctype html>, <html>, <head>, <body> tags
JSONContent starts with {/[ and ends with }/]
CSV2+ lines with consistent comma counts
MarkdownHeaders (# ...), code fences, tables, list markers
Codedef/class/function/const/let/import keywords
TextDefault fallback

The detected type is stored in the working metadata and passed to cleaners and the transformer for type-aware behavior.

Cleaners

Cleaners implement the CleanerProtocol:

@runtime_checkable
class CleanerProtocol(Protocol):
@property
def name(self) -> str: ...

def clean(self, content: str, metadata: dict | None = None) -> CleanerResult: ...

The protocol is runtime_checkable, so custom cleaners can be validated with isinstance().

clean() returns a CleanerResult dataclass with the cleaned content, the ops list, and three per-removal counts that drive the source-row quality counters:

FieldTypePopulated by
contentstrevery cleaner — the cleaned text
opslist[str]every cleaner — operation identifiers (e.g. "encoding_fix", "gibberish_removal:14")
lines_removedintOCR cleaner (gibberish + page artifact passes); 0 from text / web cleaners
paragraphs_deduplicatedintOCR cleaner's duplicate-paragraph pass; 0 from text / web cleaners
chars_removedinttext cleaner (net before/after delta); 0 elsewhere

The dataclass also defines __iter__ yielding (content, ops) so pre-W11 callers that wrote content, ops = cleaner.clean(...) keep working. New code should index the dataclass fields by name.

WebCleaner

Purpose: Extracts main content from HTML, removing boilerplate (navigation, ads, footers, scripts, styles).

When it activates: Only processes content when the metadata content_type is HTML or WEB, or when the content looks like HTML (detected via tag patterns in the first 1000 characters).

Extraction methods (in priority order):

  1. Trafilatura (preferred) -- High-quality web content extraction with configurable output format (markdown or text). Favors precision over recall, includes tables and links, deduplicates content. Requires the trafilatura package.
  2. Basic regex fallback -- Strips <script>, <style>, <nav>, <footer>, <header>, <aside> elements, converts block elements to newlines, removes remaining tags, decodes HTML entities. Used when trafilatura is unavailable.

Operations recorded: trafilatura_extraction or basic_html_extraction.

TextCleaner

Purpose: Fixes fundamental text issues -- encoding errors, unicode inconsistencies, control characters, and whitespace irregularities.

Operations (applied in order, each controlled by a setting):

StepSettingWhat It Does
Encoding fixenable_encoding_fixUses ftfy to repair mojibake (e.g., café becomes cafe). Fixes character width, line breaks, surrogates, and terminal escapes.
Unicode normalizationenable_unicode_normalizeNormalizes to NFC form via unicodedata.normalize("NFC", ...). Ensures characters with multiple encodings use a single canonical form.
Control character removalenable_control_char_removalStrips ASCII control chars (0x00-0x1F except tab/newline/CR), DEL (0x7F), and C1 control chars (0x80-0x9F).
Whitespace normalizationenable_whitespace_normalizeConverts non-breaking spaces, em/en spaces, and other unicode whitespace to standard space. Normalizes line endings to \n. Collapses multiple spaces. Collapses 3+ newlines to 2 (preserving paragraph breaks). Strips trailing whitespace per line.
BOM removalAlways onRemoves UTF-8/UTF-16 Byte Order Marks from the start of content.

Operations recorded: encoding_fix, unicode_normalize, control_char_removal, whitespace_normalize, bom_removal.

Dependency: ftfy

The encoding fix step requires the ftfy package. If unavailable, it logs a warning and skips the step gracefully -- the remaining cleaners still run.

OCRCleaner

Purpose: Removes artifacts specific to OCR output -- gibberish lines, duplicate paragraphs from multi-column misdetection, and page artifacts (headers, footers, standalone page numbers).

When it activates: Only runs when enable_ocr_cleaning is True in settings (enabled by default) and the source's extraction_method metadata names an OCR-derived pipeline (PDF text extraction, Tesseract OCR, vision-LLM-derived text). Plain .txt / .md, HTML loaders, Office loaders, and similar non-OCR sources skip this cleaner entirely so short identifiers like git, npm, K8s, or awk survive normalization.

The scoping is implemented by an applies_to(metadata) predicate on the cleaner — the registry only invokes cleaners whose predicate returns truthy for the source's metadata. Cleaners without an applies_to predicate (the text and web cleaners) always fire.

Operations (applied in order):

1. Gibberish Line Removal

Uses a hybrid validation approach for short lines (below min_line_length setting):

Key design decisions:

  • Batch spell checking -- Words needing validation are collected in a first pass and checked in a single batch via pyspellchecker, avoiding per-line overhead.
  • Structural words allowlist -- Common short words that legitimately appear alone on lines (BY, TO, THE, FOR, etc.) are always kept regardless of length.
  • Roman numerals -- Pattern-matched up to XXXIX and always kept.
  • Known OCR artifacts -- A hardcoded set of single characters and short sequences (i, l, 1, |, ii, ll, hi, ie, etc.) that are almost always OCR noise.

For longer lines, traditional heuristics apply:

  • Alpha ratio -- Lines below min_alpha_ratio are removed (too many non-alphabetic characters).
  • Gibberish patterns -- Excessive consonant clusters (5+ consonants without vowels), random capitalization alternation.

2. Duplicate Paragraph Removal

Uses a three-phase approach for O(n) performance:

  1. Exact MD5 hash -- Catches identical duplicates instantly
  2. Normalized hash -- Catches case/whitespace variations (lowercased, whitespace-collapsed)
  3. Simhash fuzzy matching -- Catches OCR character errors. Uses Hamming distance < 10 bits as a pre-filter, then confirms with SequenceMatcher against a duplicate_similarity_threshold. Only checks the 50 most recent paragraphs (duplicates from column misdetection are typically adjacent).

Falls back to pure SequenceMatcher comparison if the simhash package is unavailable.

3. Page Artifact Removal

Detects and removes:

  • Standalone page numbers -- Patterns like 42, -1-, Page 1, p. 1
  • Repeated short lines -- Lines under 30 characters appearing more than twice (likely headers/footers repeated on each page)

Operations recorded: gibberish_removal:N, duplicate_removal:N, artifact_removal:N (where N is the count of items removed).

Optional dependencies

The OCR cleaner has three optional dependencies that enhance quality but are not required:

  • pyspellchecker -- Dictionary validation for short lines (falls back to structural/allowlist checks only)
  • simhash -- Fast fuzzy duplicate detection (falls back to SequenceMatcher for recent paragraphs)
  • ftfy -- Used by TextCleaner, not OCRCleaner directly

Transformer

After all cleaners run, the MarkdownNormalizer transformer converts the cleaned content into consistent Markdown formatting.

MarkdownNormalizer

Purpose: Ensures uniform Markdown structure regardless of how the content was originally formatted.

When it activates: Only runs when enable_markdown_normalize is True in settings.

Normalization steps:

StepWhat It Does
Header normalizationEnsures space after # symbols, removes trailing # markers
List marker standardizationConverts * and + to - for unordered lists; normalizes 1) to 1.
Horizontal rule normalizationConverts various styles (***, ___, - - -) to ---
Emphasis normalizationConverts __bold__ to **bold**, _italic_ to *italic*
Spacing normalizationAdds blank lines before headers/lists/code blocks; limits consecutive blank lines to 2
JSON wrappingWraps raw JSON content in ```json code fences
Code wrappingWraps detected code in fenced code blocks with language detection (Python, JavaScript, HTML, SQL, JSON)

The transformer is content-type-aware: JSON and code-specific transformations only apply when the source type matches.

Quality Metrics

After normalization, the service calculates QualityMetrics to assess the output:

MetricWeightDescription
text_ratio30%Ratio of alphabetic characters to total characters. Higher = cleaner text.
language_confidence30%Heuristic score based on average word length (3-10 chars is ideal) and presence of common English words.
duplicate_ratio20%How much content was removed (inverted: 1.0 - ratio). Higher removal = more duplication was present.
structure_score20%Presence of Markdown structure: headers (+0.15), lists (+0.1), paragraphs (+0.1), code blocks (+0.1), tables (+0.05). Base: 0.5.

The overall score is text_ratio * 0.3 + language_confidence * 0.3 + (1 - duplicate_ratio) * 0.2 + structure_score * 0.2.

Quality metrics are logged after normalization and available in the NormalizedContent output. The indexing handler logs the average quality score across all documents for monitoring.

Output Structure

The normalization service returns a NormalizedContent dataclass:

@dataclass
class NormalizedContent:
content: str # Cleaned and normalized text
original_content: str # Original text before normalization
content_type: ContentType # Detected or specified content type
quality_metrics: QualityMetrics
metadata: dict # Enriched metadata (includes content_type)

@property
def char_count(self) -> int: ...
@property
def word_count(self) -> int: ...

The indexing handler joins all normalized documents with "\n\n" to produce the full text for chunking.

Configuration

Normalization behavior is controlled by NormalizerSettings (defined in chaoscypher_core.settings).

Settings now reach the cleaners

Pre-W5, the operator's NormalizerSettings were silently ignored — the indexing handler instantiated cleaners with default settings regardless of what was configured. As of May 2026 the handler threads the resolved NormalizerSettings through IndexingService to every cleaner instance, so flipping enable_ocr_cleaning: false in settings.yaml actually disables the OCR cleaner instead of pretending to.

SettingDefaultEffect
enable_encoding_fixTrueEnable ftfy encoding repair
enable_unicode_normalizeTrueEnable NFC unicode normalization
enable_control_char_removalTrueStrip non-printable characters
enable_whitespace_normalizeTrueNormalize whitespace and line endings
enable_ocr_cleaningTrueEnable OCR artifact removal
enable_duplicate_removalTrueEnable paragraph deduplication
enable_markdown_normalizeTrueEnable markdown formatting normalization
min_line_lengthvariesThreshold for short line gibberish detection
min_alpha_ratiovariesMinimum alphabetic character ratio for longer lines
duplicate_similarity_thresholdvariesSequenceMatcher threshold for fuzzy dedup
target_format"markdown"Output format for web extraction

Custom Cleaners

Custom cleaners can be injected into ContentNormalizerService via the cleaners constructor parameter:

from chaoscypher_core.services.sources.normalizer.cleaners import CleanerResult


class MyCustomCleaner:
@property
def name(self) -> str:
return "my_custom_cleaner"

def clean(self, content: str, metadata: dict | None = None) -> CleanerResult:
cleaned = content.replace("REDACTED", "[REMOVED]")
return CleanerResult(
content=cleaned,
ops=["custom_redaction"],
chars_removed=len(content) - len(cleaned),
)


# Use custom cleaner pipeline
service = ContentNormalizerService(
cleaners=[WebCleaner(settings), TextCleaner(settings), MyCustomCleaner()]
)

The custom cleaner must satisfy CleanerProtocol (runtime-checkable via isinstance()). Counts you populate flow through to the source's quality counters; leaving them at 0 is fine when the cleaner doesn't naturally track that signal.

Code Locations

ComponentPath
ContentNormalizerServicepackages/core/src/chaoscypher_core/services/sources/normalizer/service.py
Normalizer __init__ (barrel)packages/core/src/chaoscypher_core/services/sources/normalizer/__init__.py
Models (ContentType, QualityMetrics, NormalizedContent)packages/core/src/chaoscypher_core/services/sources/normalizer/models.py
CleanerProtocolpackages/core/src/chaoscypher_core/services/sources/normalizer/cleaners/base.py
TextCleanerpackages/core/src/chaoscypher_core/services/sources/normalizer/cleaners/text_cleaner.py
OCRCleanerpackages/core/src/chaoscypher_core/services/sources/normalizer/cleaners/ocr_cleaner.py
WebCleanerpackages/core/src/chaoscypher_core/services/sources/normalizer/cleaners/web_cleaner.py
TransformerProtocolpackages/core/src/chaoscypher_core/services/sources/normalizer/transformers/base.py
MarkdownNormalizerpackages/core/src/chaoscypher_core/services/sources/normalizer/transformers/markdown_transformer.py
Indexing handler (calls normalizer)packages/core/src/chaoscypher_core/operations/importing/indexing_handler.py