Skip to main content

Relationship Mapping

Relationships are the edges of the knowledge graph -- they connect entities to each other with typed, directional links. Unlike entities, which are extracted primarily from noun phrases and descriptions, relationships are extracted from the verbs, prepositions, and contextual connections that the LLM identifies between entities within each chunk.

This page covers how relationships flow from initial extraction through to validated, deduplicated edge data ready for the commit phase.

Extraction: Relationships Alongside Entities

Relationships are extracted in the same LLM call as entities, not in a separate pass. Each chunk produces a list of entities and a list of relationships that reference those entities by chunk-local integer indices.

{
"entities": [
{"name": "Einstein", "type": "Person", ...},
{"name": "Relativity", "type": "Theory", ...}
],
"relationships": [
{
"source": 0,
"target": 1,
"type": "developed",
"confidence": 0.95,
"justification": "Einstein developed the theory of relativity",
"sent_ref": "s3",
"chunk_index": 2
}
]
}

Each relationship carries metadata from extraction:

FieldDescription
source / targetInteger indices into the chunk's entity list
typeRelationship type label (e.g., developed, located_in, parent_of)
confidenceLLM-assigned confidence score (0.0 - 1.0)
justificationLLM-generated explanation of why this relationship exists
sent_refSentence reference(s) in the source chunk that evidence this relationship
chunk_indexWhich chunk this relationship was extracted from
propertiesOptional additional properties

Index Aggregation Across Chunks

When per-chunk results are aggregated into a global list, relationship indices must be remapped from chunk-local to global scope. The aggregate_chunk_results() function handles this by tracking an entity offset as it concatenates chunk entity lists:

Chunk 1's indices are shifted by the number of entities from chunk 0 (3), so source=1 becomes source=4 and target=3 becomes target=6.

Implementation: orchestration.py -- aggregate_chunk_results()

Reference Resolution

After aggregation, entity indices must survive the deduplication pipeline. Every deduplication stage produces an index_mapping dict that maps old entity indices to new ones (or None for removed entities). Relationships are remapped through each stage via EntityProcessor.remap_relationship_indices().

Remapping Rules

ConditionAction
Both indices map to valid new indicesRemap and keep
Either index maps to None (entity removed)Drop relationship
Both indices map to the same new index (entities merged)Drop as self-loop
Non-integer source/target valuesDrop with warning

Self-loop detection is critical: when two entities merge (e.g., "Einstein" and "Albert Einstein" collapse into one), any relationship between them becomes a self-referential edge and is removed.

Implementation: EntityProcessor.remap_relationship_indices() in deduplication/service.py

Relationship Validation

Before relationships enter the deduplication pipeline, they go through validation in the extraction layer:

  1. Bounds checking: Source and target indices must be valid integers within the entity list range.
  2. Self-loop rejection: Relationships where source equals target are dropped immediately.
  3. Name-based resolution (NLP workflow): When relationships use entity names (from/to fields) instead of integer indices, the validator resolves names to indices via a case-insensitive name/alias index built from the entity list.

Implementation: validate_relationships() in extraction/utils/entity_cleaner.py

Relationship Deduplication

After entity deduplication is complete, relationships go through their own deduplication pass:

Exact Triple Dedup

Duplicate (source, target, type) triples are collapsed. When duplicates exist, the relationship with the highest confidence is kept. This commonly occurs when the same relationship is extracted from overlapping chunks.

Symmetric Relationship Collapse

For domain-defined symmetric relationship types (e.g., spouse_of, interacts_with, allies_with), the pair (A, B) and (B, A) are semantically identical. The deduplication pass normalizes these by sorting the node pair and keeping only the highest-confidence direction.

(Einstein, Bohr, "collaborates_with", conf=0.9)
(Bohr, Einstein, "collaborates_with", conf=0.85)
--> Keep: (Einstein, Bohr, "collaborates_with", conf=0.9)

Symmetric types are provided by the domain configuration via get_symmetric_relationships().

Inverse-Pair Collapse

For domain-defined inverse relationship pairs (e.g., parent_of/child_of, employs/employed_by), the extraction may produce both directions. Since the commit phase auto-generates inverse edges, having both in the extraction results would create duplicates. The dedup pass removes the inverse direction if the canonical direction already exists.

Inverse pairs are provided by the domain configuration via get_inverse_relationships().

Implementation: deduplicate_relationships() in extraction/utils/entity_cleaner.py

Relationship Flow Through the Pipeline

The complete journey of a relationship from extraction to commit-ready state:

At each entity deduplication stage, relationships are remapped to reflect the new entity indices. By the end, the relationship list contains only unique, validated edges with globally correct indices into the final deduplicated entity list.

Domain-Specific Guidance

The extraction LLM receives domain-specific relationship guidance that shapes what types of relationships it looks for. This guidance comes from the active domain configuration and includes:

  • Relationship type definitions: What edge types exist and what they mean (e.g., parent_of, located_in, authored_by).
  • Edge templates: Pre-defined templates with descriptions that help the LLM assign consistent types.
  • Relationship examples: Concrete examples of entity-relationship-entity triples from the domain.
  • Inverse relationship definitions: Which edge types are inverses of each other.
  • Symmetric relationship declarations: Which edge types are bidirectional.

This guidance is formatted into the LLM prompt by format_extraction_templates() and format_domain_edge_templates() from the orchestration layer.

Implementation: orchestration.py -- format_extraction_templates(), detect_extraction_domain()

Edge Properties

Each relationship carries properties that are preserved through the pipeline and committed as edge properties on the final graph:

PropertySourcePurpose
confidenceLLM extractionQuality signal for graph queries and UI display
justificationLLM extractionHuman-readable explanation of the relationship
sent_refLLM extractionSentence reference for citation tracking
chunk_indexPipeline trackingProvenance -- which chunk evidenced this relationship
inverse_ofCommit phaseSet on auto-generated inverse edges to mark their origin

These properties flow through the entire pipeline unchanged (deduplication only modifies source/target indices) and are written to graph edges during the commit phase by _build_edge_properties() in the relationship commit handler.