Quality metrics
The quality_metrics block on every SourceResponse
records what the pipeline silently dropped, deduplicated, or merged on
the way to the knowledge graph. There are 15 counters plus three
companion fields. They back the Data Quality tab
in the UI and the data the CLI exposes via chaoscypher source get SOURCE_ID.
This page is the field reference. For the user-facing explanation of when to consult counters vs. the quality grade, see the Data Quality tab page in the user guide.
Where counters live
GET /api/v1/sources/{source_id}
Returns a SourceResponse. The
quality_metrics field has the following shape:
{
"id": "src_abc123",
"...": "... other source fields ...",
"quality_metrics": {
"loader_encoding_used": "utf-8",
"loader_warnings_count": 0,
"loader_files_skipped": 0,
"cleaner_lines_removed": 142,
"cleaner_paragraphs_deduplicated": 7,
"cleaner_chars_removed": 8420,
"chunks_filtered_count": 3,
"llm_chunks_truncated": 0,
"llm_chunks_aborted_by_loop": 1,
"parser_lines_dropped": 4,
"dedup_entities_merged": 12,
"structural_entities_filtered": 2,
"orphan_entities_filtered": 0,
"relationships_dropped_invalid": 1,
"relationships_dropped_capped": 5,
"citations_skipped_no_chunk_index": 0,
"vector_indexed_at": "2026-05-08T11:23:14.412Z",
"vector_indexing_status": "indexed"
}
}
vector_indexing_status is also surfaced at the top level of
SourceResponse for convenience, so clients that only need the badge
state don't have to drill into quality_metrics.
Field reference
Loader stage
| Field | Type | Description |
|---|---|---|
loader_encoding_used | string? | Encoding the loader actually used to decode this file. Values include utf-8, utf-8-bom, cp1252, latin-1-fallback, utf-8-replace, or any encoding label charset-normalizer returns. null until the loader runs. |
loader_warnings_count | int | Non-fatal loader hiccups the user should know about (a single bad JSONL line, an undecodable archive member, an empty worksheet). The file still loaded. |
loader_files_skipped | int | Archive entries skipped due to unsupported extension, oversized content, or security validation (path traversal, absolute paths, symlinks). |
Normalization stage
| Field | Type | Description |
|---|---|---|
cleaner_lines_removed | int | Lines the OCR cleaner dropped — gibberish, standalone page numbers, repeated headers / footers. |
cleaner_paragraphs_deduplicated | int | Duplicate paragraphs collapsed by the cleaner — typically multi-column OCR producing the same text twice. |
cleaner_chars_removed | int | Net character delta from the text cleaner (encoding fixes, control-character stripping, whitespace collapse). |
Chunking stage
| Field | Type | Description |
|---|---|---|
chunks_filtered_count | int | Chunks the content-filter stripped to under 100 characters before extraction. The filtered chunks remain searchable via RAG; only the LLM-bound copies are removed. |
LLM extraction stage
| Field | Type | Description |
|---|---|---|
llm_chunks_truncated | int | Chunks where the LLM hit its token cap and stopped mid-response. Some content was extracted; some was lost. Drives the length finish-reason. |
llm_chunks_aborted_by_loop | int | Chunks where the streaming loop detector aborted the LLM (degenerate repetition, out-of-bounds indices, runaway entity counts). |
parser_lines_dropped | int | Malformed E| / R| / P| lines the parser couldn't make sense of (missing fields, out-of-bounds indices, invalid numerics). |
Post-extraction
| Field | Type | Description |
|---|---|---|
dedup_entities_merged | int | Entities collapsed into another entity by exact-name or semantic deduplication. Highest-confidence record wins; aliases / properties merge. |
structural_entities_filtered | int | Entities representing document structure (chapters, sections, page headers) removed by the structural filter. Configured per-domain via is_structural. |
relationships_dropped_invalid | int | Relationships pointing at non-existent entity indices. Catches LLM index-skew before it hits the graph. |
relationships_dropped_capped | int | Relationships dropped by the per-entity degree cap, the same-source-type cap, or the total-ratio cap from the active filtering mode. |
Commit stage
| Field | Type | Description |
|---|---|---|
orphan_entities_filtered | int | Entities that survived deduplication but had zero relationships, dropped at commit when orphan filtering is enabled by the active filtering mode. |
citations_skipped_no_chunk_index | int | Entity / relationship citations skipped because the underlying record had no chunk index to point at. |
Vector search status
| Field | Type | Description |
|---|---|---|
vector_indexed_at | datetime? | When the post-commit vector indexing call succeeded; null until then. |
vector_indexing_status | string | pending, indexed, degraded, or failed. See Search Status. |
Reset behavior
Every counter and the two vector_* fields reset to their
post-upload defaults when you trigger force_re_extract:
curl -X POST http://localhost:8080/api/v1/sources/src_abc123/re_extract
Reset values:
- All 15 counters →
0 loader_encoding_used→nullvector_indexed_at→nullvector_indexing_status→"pending"
This is intentional: re-extract is the moment to compare a new run
against the old one. Take a snapshot of quality_metrics before
calling re-extract if you want to diff against the new values.
The cached quality grade (cached_quality_grade,
cached_avg_entity_quality, etc.) is recomputed and persisted
separately — those fields are not in the reset set.
See also
- Sources API — full source endpoint reference
- Data Quality tab (user guide) — when to consult counters vs. the grade
- Search Status (user guide) — the four
vector_indexing_statusstates - Filtering Modes — how the mode you pick changes which counters move