Skip to main content

Extraction Domains

Extraction domains configure how the AI extracts entities and relationships from your documents. Each domain is a .jsonld file that defines entity types, relationship types, detection rules, quality scoring, and LLM guidance -- no Python code required.

How Domains Work

When a document is processed, Chaos Cypher:

  1. Detects the best domain by scoring keywords, file extensions, and patterns against the document content
  2. Injects domain-specific guidance into the LLM extraction prompt
  3. Constrains extraction output using the domain's templates and rules
  4. Validates results using quality scoring, deduplication, and type compatibility

If no domain matches with sufficient confidence, the generic domain is used as a fallback.

Domain Detection

Detection works by sampling document content (first 3 chunks + middle 2 chunks, up to 8000 characters) and scoring each domain:

FactorWhat It ChecksExample
KeywordsDomain-specific terms in text"plaintiff", "verdict" for legal
File extensionsSource file type.py for technical
Doc typeMetadata-based type hints"research_paper" for scientific
Regex patternsStructural patternsBirth/death date ranges for biographical

The domain with the highest confidence score above its threshold wins. If none match, generic is used.

Built-In Domains

DomainDescriptionStrict TypesDensity
genericGeneral-purpose fallback for any contentNo1.0
technicalCode, APIs, software documentationYes1.2
scientificResearch papers, studies, experimentsYes1.3
biographicalBiographies, memoirs, life storiesYes1.0
literaryFiction, poetry, dramaYes1.0
medicalMedical and healthcare contentYes1.2
legalLegal documents, contracts, regulationsYes1.25
financialFinancial documents, reports, analysisYes1.2
historicalHistorical texts and analysisYes1.1
politicalPolitical content, speeches, policyYes1.1
educationalEducational content, courses, curriculaYes1.1
newsNews articles, journalismYes0.95
theologicalReligious texts, theologyYes1.15
philosophicalPhilosophy, ethics, epistemologyYes1.15
cybersecuritySecurity documentation, threats, CVEsYes1.2
investigationCriminal/civil investigations, case files, evidenceYes1.3

Strict Types means only entity types defined in the domain's templates are allowed -- the LLM cannot invent new types. The generic domain allows any type.

Density controls how many entities/relationships the LLM is expected to extract per chunk. Higher density domains (scientific: 1.3, investigation: 1.3) produce more detailed graphs. Lower density (news: 0.95) produces leaner, focused graphs.

Extraction Limits

Each domain defines hard caps that prevent runaway LLM generation and control graph density:

SettingWhat It ControlsExample Range
max_entity_degreeMax relationships per entity (in + out combined)15 (news) -- 40 (literary)
max_same_source_typeMax relationships with same (source, relationship type) pair6 (news) -- 12 (literary)
max_relationship_ratioMax relationships as a multiplier of entity count5.0 (news) -- 8.0 (most domains)
loop_max_entity_countMax entities per chunk before aborting LLM streaming25 (news) -- 50 (literary)

These limits are enforced in three passes during extraction finalization:

  1. Same-pair cap -- Keeps highest-confidence relationships when a (source entity, relationship type) pair exceeds the limit
  2. Degree cap -- Prevents any single entity from having too many connections, with orphan protection for entities that would otherwise have zero edges
  3. Total cap -- Limits total relationship count to max_relationship_ratio x entity_count, again with orphan protection

The generic domain uses global defaults (degree: 25, same-type: 12, ratio: 8.0, loop: 50). Specialized domains tune these based on content characteristics -- news articles get tighter limits to avoid over-connecting, while literary works allow denser graphs.

Entity Exclusion Rules

Each domain specifies what the LLM should not extract to reduce noise:

DomainExcluded Items
biographicalBare date ranges ("1920--1985"), generic familial roles ("the father"), source citations ("[1]")
educationalStructural markers ("Chapter 1"), boilerplate ("In this chapter you will learn"), generic refs ("the student")
financialRaw numbers alone ("$5M"), ticker symbols without context ("AAPL"), boilerplate disclaimers
legalParagraph numbers ("Section 3.1"), procedural boilerplate ("hereby"), citation formatting ("Id.", "supra")
technicalImport statements, version numbers, code boilerplate, generic comments
investigationReport headers/footers, generic role refs ("the officer"), form instructions

Content Exclusions

Domains define which content categories to strip before extraction. For example, the technical domain excludes toc, changelog, legal, boilerplate, api_tables, and web_artifacts because these rarely contain extractable entities.

Exclusion configuration in domain .jsonld files:

"content_exclusions": {
"categories": ["toc", "changelog", "legal", "boilerplate"],
"custom_patterns": [
{
"regex": "^\\s*v?\\d+\\.\\d+",
"mode": "count",
"threshold": 3,
"description": "Version number lists"
}
]
}
  • categories -- References built-in category names (15 available: toc, changelog, legal, bibliography, acknowledgments, boilerplate, metadata, code_blocks, data_tables, math, api_tables, procedural, advertising, web_artifacts, bulk_lists)
  • custom_patterns -- Domain-specific regex patterns with mode (count to exclude whole chunks, line_ratio to strip matching lines) and threshold

Custom Domains

Place a .jsonld file in data/plugins/domains/ and it will be auto-discovered on startup:

data/
plugins/
domains/
my_domain.jsonld

Minimal domain:

{
"@context": { "@vocab": "https://chaoscypher.io/schema/domain#" },
"@type": "ExtractionDomain",
"name": "my_domain",
"version": "1.0.0",
"description": "Custom domain for my content type",
"extraction_density": 1.0,
"strict_entity_types": true,

"detection": {
"keywords": {
"primary": { "terms": ["keyword1", "keyword2"], "weight": 1.0 }
},
"confidence": {
"base_score": 0.25,
"per_keyword_boost": 0.04,
"min_threshold": 0.4
}
},

"entity_guidance": "Extract entities relevant to my domain...",
"relationship_guidance": "Focus on these relationship types...",

"templates": {
"node_templates": [
{
"id": "my_entity",
"name": "My Entity",
"description": "Description of this entity type",
"quality_score": 20
}
],
"edge_templates": [
{
"id": "my_relationship",
"name": "relates_to",
"description": "How entities relate",
"quality_score": 15
}
]
},

"extraction_limits": {
"max_entity_degree": 20,
"max_same_source_type": 8,
"max_relationship_ratio": 6.0,
"loop_max_entity_count": 35
}
}

For the full domain configuration schema and advanced patterns like type compatibility groups, property absorption, and evidence validation modes, see the Building Extraction Domains guide.

Reclassifying a Source

If auto-detection chose the wrong domain, or if you want to re-run extraction with a different domain after reviewing the initial results, use the reclassify action on the source detail page — or call the API directly.

Reclassification:

  1. Resets any prior graph artifacts for committed sources (atomically — the graph stays consistent even if the process is interrupted).
  2. Queues a new extraction pass using the specified domain.
# Reclassify src_abc123 under the "medical" domain
curl -X POST http://localhost:8080/api/v1/sources/src_abc123/reclassify \
-H "Content-Type: application/json" \
-d '{"domain": "medical"}'

Eligible source states: indexed (extraction never run or was cancelled) or committed (full re-extraction).

This is the preferred approach over passing domain at upload time, because domain selection is often only meaningful after you can inspect the document content.