Skip to main content

Filtering modes

When you upload a source, ChaosCypher's AI extracts entities and relationships from the text. The raw output contains some noise — the AI may repeat itself, hallucinate, or label things with generic types. The filtering mode slider controls how aggressively the pipeline cleans that raw output.

There are six modes, sliding from "keep everything" to "only confident facts." Pick the one that matches the kind of source you're importing.

At a glance

ModeVibeWhat's on
0 unfilteredRawNothing (just index validation)
1 minimalPermissiveRelationship caps
2 lenientNarrative-friendly+ types, plausibility, structural, orphans
3 balancedDefaultStandard thresholds
4 strictStructured docsStrict edge types, tighter caps
5 maximumCuratedHighest plausibility, tightest caps, smallest aliases

Three knobs move monotonically as you slide the mode up:

Knob0 unfiltered1 minimal2 lenient3 balanced4 strict5 maximum
loop_max_entity_count20010075503525
semantic_dedup_threshold0.990.970.930.900.870.85
minimum_alias_length112233

loop_max_entity_count aborts a chunk whose LLM stream emits more entity lines than the cap (catches degenerate loops earlier in stricter modes). semantic_dedup_threshold is the cosine-similarity bar for merging two entities; lower means more aggressive merging. minimum_alias_length drops short aliases like "AI" or "ML" in stricter modes to keep the alias index focused on full names.

The six modes

0 — Unfiltered

Keep everything the AI produced, including obvious noise. Useful when you're debugging extraction quality or want to see exactly what the AI saw before any filters touched it.

What's on:

  • Index validation only (relationships pointing at non-existent entities are still dropped — those would crash the graph).

What's off: every other filter.

1 — Minimal

Drop only the obviously broken stuff. Tolerant of repetition and unusual entity types.

What's on: relationship caps (loose), invalid-reference checks.

What's off: type constraints, plausibility, structural filtering, orphan filtering.

Use for: raw inspection, fiction with experimental structure, collections where you'd rather audit yourself than trust the system.

2 — Lenient

Forgiving with prose that uses pronouns and loose references. Designed for narrative text where the AI has to do a lot of pronoun resolution.

What changes vs. minimal:

  • Type constraints turned on (relationships must align with their entity types, but rules are forgiving).
  • Plausibility filter turned on, but at low thresholds (0.20 named, 0.10 non-named).
  • Structural filtering turned on (drops "Chapter 1" / "Section A" style noise).
  • Orphan filtering turned on (drops entities with no relationships).

Use for: novels, biographies, narrative non-fiction, journals.

3 — Balanced (default)

Sensible filtering for general-purpose documents. Medium evidence and type rules; standard plausibility thresholds.

Use for: blog posts, articles, mixed corpora, "I don't know what kind of document this is, just work."

4 — Strict

Tightens type constraints and visual-content plausibility. Drops relationships that don't match the active domain's edge templates.

What changes vs. balanced:

  • Strict edge-type constraints: relationships whose source/target types don't fit the domain's allowed edge templates are dropped.
  • Visual content plausibility factor raised from 0.5 to 1.0.
  • Per-entity relationship caps tightened (max degree 20, max same-source-type 10, max ratio 6.0).
  • Loop detection triggers earlier (35 entities/chunk vs. 50).
  • Aliases must be 3+ characters.

Use for: structured technical docs, API references, scientific papers, domain-specific corpora where you have well-defined entity and edge templates.

5 — Maximum

Highest precision. Drops anything below a high quality bar. Use when you want only confident, well-evidenced facts and you're willing to miss some real content to get there.

What changes vs. strict:

  • Plausibility threshold raised (0.40 named, 0.25 non-named).
  • Per-entity caps tightened further (max degree 15 vs. 20, max same-source-type 7 vs. 10, max ratio 4.0 vs. 6.0).
  • Loop detection triggers earliest (25 entities/chunk).
  • Semantic dedup threshold lowest (0.85), so more aggressive merging.

Use for: building a curated knowledge graph, executive briefings, fact-checking corpora, high-precision search.

How filtering mode interacts with re-extract

Filtering mode is a property of the source, persisted at upload time. Re-extract (force=true) reuses the source's stored filtering mode by default. To re-extract under a different mode, pass filtering_mode=… to the re-extract endpoint or use the "Re-extract with options" button in the UI.

What about my counters?

Every filter that drops content increments a counter on the source's data quality panel (see Quality Analysis). After a re-extract, those counters reset to zero and accumulate fresh stats so you can see exactly what the new mode did differently.