Relationship Mapping
Relationships are the edges of the knowledge graph -- they connect entities to each other with typed, directional links. Unlike entities, which are extracted primarily from noun phrases and descriptions, relationships are extracted from the verbs, prepositions, and contextual connections that the LLM identifies between entities within each chunk.
This page covers how relationships flow from initial extraction through to validated, deduplicated edge data ready for the commit phase.
Extraction: Relationships Alongside Entities
Relationships are extracted in the same LLM call as entities, not in a separate pass. Each chunk produces a list of entities and a list of relationships that reference those entities by chunk-local integer indices.
{
"entities": [
{"name": "Einstein", "type": "Person", ...},
{"name": "Relativity", "type": "Theory", ...}
],
"relationships": [
{
"source": 0,
"target": 1,
"type": "developed",
"confidence": 0.95,
"justification": "Einstein developed the theory of relativity",
"sent_ref": "s3",
"chunk_index": 2
}
]
}
Each relationship carries metadata from extraction:
| Field | Description |
|---|---|
source / target | Integer indices into the chunk's entity list |
type | Relationship type label (e.g., developed, located_in, parent_of) |
confidence | LLM-assigned confidence score (0.0 - 1.0) |
justification | LLM-generated explanation of why this relationship exists |
sent_ref | Sentence reference(s) in the source chunk that evidence this relationship |
chunk_index | Which chunk this relationship was extracted from |
properties | Optional additional properties |
Index Aggregation Across Chunks
When per-chunk results are aggregated into a global list, relationship indices must be remapped from chunk-local to global scope. The aggregate_chunk_results() function handles this by tracking an entity offset as it concatenates chunk entity lists:
Chunk 1's indices are shifted by the number of entities from chunk 0 (3), so source=1 becomes source=4 and target=3 becomes target=6.
Implementation: orchestration.py -- aggregate_chunk_results()
Reference Resolution
After aggregation, entity indices must survive the deduplication pipeline. Every deduplication stage produces an index_mapping dict that maps old entity indices to new ones (or None for removed entities). Relationships are remapped through each stage via EntityProcessor.remap_relationship_indices().
Remapping Rules
| Condition | Action |
|---|---|
| Both indices map to valid new indices | Remap and keep |
Either index maps to None (entity removed) | Drop relationship |
| Both indices map to the same new index (entities merged) | Drop as self-loop |
| Non-integer source/target values | Drop with warning |
Self-loop detection is critical: when two entities merge (e.g., "Einstein" and "Albert Einstein" collapse into one), any relationship between them becomes a self-referential edge and is removed.
Implementation: EntityProcessor.remap_relationship_indices() in deduplication/service.py
Relationship Validation
Before relationships enter the deduplication pipeline, they go through validation in the extraction layer:
- Bounds checking: Source and target indices must be valid integers within the entity list range.
- Self-loop rejection: Relationships where source equals target are dropped immediately.
- Name-based resolution (NLP workflow): When relationships use entity names (
from/tofields) instead of integer indices, the validator resolves names to indices via a case-insensitive name/alias index built from the entity list.
Implementation: validate_relationships() in extraction/utils/entity_cleaner.py
Relationship Deduplication
After entity deduplication is complete, relationships go through their own deduplication pass:
Exact Triple Dedup
Duplicate (source, target, type) triples are collapsed. When duplicates exist, the relationship with the highest confidence is kept. This commonly occurs when the same relationship is extracted from overlapping chunks.
Symmetric Relationship Collapse
For domain-defined symmetric relationship types (e.g., spouse_of, interacts_with, allies_with), the pair (A, B) and (B, A) are semantically identical. The deduplication pass normalizes these by sorting the node pair and keeping only the highest-confidence direction.
(Einstein, Bohr, "collaborates_with", conf=0.9)
(Bohr, Einstein, "collaborates_with", conf=0.85)
--> Keep: (Einstein, Bohr, "collaborates_with", conf=0.9)
Symmetric types are provided by the domain configuration via get_symmetric_relationships().
Inverse-Pair Collapse
For domain-defined inverse relationship pairs (e.g., parent_of/child_of, employs/employed_by), the extraction may produce both directions. Since the commit phase auto-generates inverse edges, having both in the extraction results would create duplicates. The dedup pass removes the inverse direction if the canonical direction already exists.
Inverse pairs are provided by the domain configuration via get_inverse_relationships().
Implementation: deduplicate_relationships() in extraction/utils/entity_cleaner.py
Relationship Flow Through the Pipeline
The complete journey of a relationship from extraction to commit-ready state:
At each entity deduplication stage, relationships are remapped to reflect the new entity indices. By the end, the relationship list contains only unique, validated edges with globally correct indices into the final deduplicated entity list.
Domain-Specific Guidance
The extraction LLM receives domain-specific relationship guidance that shapes what types of relationships it looks for. This guidance comes from the active domain configuration and includes:
- Relationship type definitions: What edge types exist and what they mean (e.g.,
parent_of,located_in,authored_by). - Edge templates: Pre-defined templates with descriptions that help the LLM assign consistent types.
- Relationship examples: Concrete examples of entity-relationship-entity triples from the domain.
- Inverse relationship definitions: Which edge types are inverses of each other.
- Symmetric relationship declarations: Which edge types are bidirectional.
This guidance is formatted into the LLM prompt by format_extraction_templates() and format_domain_edge_templates() from the orchestration layer.
Implementation: orchestration.py -- format_extraction_templates(), detect_extraction_domain()
Edge Properties
Each relationship carries properties that are preserved through the pipeline and committed as edge properties on the final graph:
| Property | Source | Purpose |
|---|---|---|
confidence | LLM extraction | Quality signal for graph queries and UI display |
justification | LLM extraction | Human-readable explanation of the relationship |
sent_ref | LLM extraction | Sentence reference for citation tracking |
chunk_index | Pipeline tracking | Provenance -- which chunk evidenced this relationship |
inverse_of | Commit phase | Set on auto-generated inverse edges to mark their origin |
These properties flow through the entire pipeline unchanged (deduplication only modifies source/target indices) and are written to graph edges during the commit phase by _build_edge_properties() in the relationship commit handler.