Skip to main content

Quality metrics

The quality_metrics block on every SourceResponse records what the pipeline silently dropped, deduplicated, or merged on the way to the knowledge graph. There are 15 counters plus three companion fields. They back the Data Quality tab in the UI and the data the CLI exposes via chaoscypher source get SOURCE_ID.

This page is the field reference. For the user-facing explanation of when to consult counters vs. the quality grade, see the Data Quality tab page in the user guide.

Where counters live

GET /api/v1/sources/{source_id}

Returns a SourceResponse. The quality_metrics field has the following shape:

{
"id": "src_abc123",
"...": "... other source fields ...",
"quality_metrics": {
"loader_encoding_used": "utf-8",
"loader_warnings_count": 0,
"loader_files_skipped": 0,
"cleaner_lines_removed": 142,
"cleaner_paragraphs_deduplicated": 7,
"cleaner_chars_removed": 8420,
"chunks_filtered_count": 3,
"llm_chunks_truncated": 0,
"llm_chunks_aborted_by_loop": 1,
"parser_lines_dropped": 4,
"dedup_entities_merged": 12,
"structural_entities_filtered": 2,
"orphan_entities_filtered": 0,
"relationships_dropped_invalid": 1,
"relationships_dropped_capped": 5,
"citations_skipped_no_chunk_index": 0,
"vector_indexed_at": "2026-05-08T11:23:14.412Z",
"vector_indexing_status": "indexed"
}
}

vector_indexing_status is also surfaced at the top level of SourceResponse for convenience, so clients that only need the badge state don't have to drill into quality_metrics.

Field reference

Loader stage

FieldTypeDescription
loader_encoding_usedstring?Encoding the loader actually used to decode this file. Values include utf-8, utf-8-bom, cp1252, latin-1-fallback, utf-8-replace, or any encoding label charset-normalizer returns. null until the loader runs.
loader_warnings_countintNon-fatal loader hiccups the user should know about (a single bad JSONL line, an undecodable archive member, an empty worksheet). The file still loaded.
loader_files_skippedintArchive entries skipped due to unsupported extension, oversized content, or security validation (path traversal, absolute paths, symlinks).

Normalization stage

FieldTypeDescription
cleaner_lines_removedintLines the OCR cleaner dropped — gibberish, standalone page numbers, repeated headers / footers.
cleaner_paragraphs_deduplicatedintDuplicate paragraphs collapsed by the cleaner — typically multi-column OCR producing the same text twice.
cleaner_chars_removedintNet character delta from the text cleaner (encoding fixes, control-character stripping, whitespace collapse).

Chunking stage

FieldTypeDescription
chunks_filtered_countintChunks the content-filter stripped to under 100 characters before extraction. The filtered chunks remain searchable via RAG; only the LLM-bound copies are removed.

LLM extraction stage

FieldTypeDescription
llm_chunks_truncatedintChunks where the LLM hit its token cap and stopped mid-response. Some content was extracted; some was lost. Drives the length finish-reason.
llm_chunks_aborted_by_loopintChunks where the streaming loop detector aborted the LLM (degenerate repetition, out-of-bounds indices, runaway entity counts).
parser_lines_droppedintMalformed E| / R| / P| lines the parser couldn't make sense of (missing fields, out-of-bounds indices, invalid numerics).

Post-extraction

FieldTypeDescription
dedup_entities_mergedintEntities collapsed into another entity by exact-name or semantic deduplication. Highest-confidence record wins; aliases / properties merge.
structural_entities_filteredintEntities representing document structure (chapters, sections, page headers) removed by the structural filter. Configured per-domain via is_structural.
relationships_dropped_invalidintRelationships pointing at non-existent entity indices. Catches LLM index-skew before it hits the graph.
relationships_dropped_cappedintRelationships dropped by the per-entity degree cap, the same-source-type cap, or the total-ratio cap from the active filtering mode.

Commit stage

FieldTypeDescription
orphan_entities_filteredintEntities that survived deduplication but had zero relationships, dropped at commit when orphan filtering is enabled by the active filtering mode.
citations_skipped_no_chunk_indexintEntity / relationship citations skipped because the underlying record had no chunk index to point at.

Vector search status

FieldTypeDescription
vector_indexed_atdatetime?When the post-commit vector indexing call succeeded; null until then.
vector_indexing_statusstringpending, indexed, degraded, or failed. See Search Status.

Reset behavior

Every counter and the two vector_* fields reset to their post-upload defaults when you trigger force_re_extract:

curl -X POST http://localhost:8080/api/v1/sources/src_abc123/re_extract

Reset values:

  • All 15 counters → 0
  • loader_encoding_usednull
  • vector_indexed_atnull
  • vector_indexing_status"pending"

This is intentional: re-extract is the moment to compare a new run against the old one. Take a snapshot of quality_metrics before calling re-extract if you want to diff against the new values.

The cached quality grade (cached_quality_grade, cached_avg_entity_quality, etc.) is recomputed and persisted separately — those fields are not in the reset set.

See also