Quality counters
The quality counter system records, on every source row, what the extraction pipeline silently dropped, deduplicated, or merged. Fifteen typed counters cover every silent-drop site from loader to commit. This page is the internal architecture reference; for the user-facing explanation see the Data Quality tab in the user guide and the Quality Metrics API for the field-level reference.
Why counters
Every stage of the pipeline has at least one defensible reason to remove content — duplicate paragraphs, boilerplate chunks, malformed LLM lines, relationships pointing at non-existent entities. Pre-W2, those drops were invisible. The quality grade told you the graph was small without telling you whether 8 entities meant "the source was short" or "the cleaner ate 90% of the input."
Counters make every silent drop legible without changing pipeline behaviour: they are pure observability, never control flow.
The QualityCounter enum
Every counter column lives in one place — the QualityCounter
StrEnum in chaoscypher_core.services.quality.counters:
class QualityCounter(StrEnum):
LOADER_WARNINGS = "loader_warnings_count"
LOADER_FILES_SKIPPED = "loader_files_skipped"
CLEANER_LINES_REMOVED = "cleaner_lines_removed"
CLEANER_PARAGRAPHS_DEDUPLICATED = "cleaner_paragraphs_deduplicated"
CLEANER_CHARS_REMOVED = "cleaner_chars_removed"
CHUNKS_FILTERED = "chunks_filtered_count"
LLM_CHUNKS_TRUNCATED = "llm_chunks_truncated"
LLM_CHUNKS_ABORTED_BY_LOOP = "llm_chunks_aborted_by_loop"
PARSER_LINES_DROPPED = "parser_lines_dropped"
DEDUP_ENTITIES_MERGED = "dedup_entities_merged"
STRUCTURAL_ENTITIES_FILTERED = "structural_entities_filtered"
ORPHAN_ENTITIES_FILTERED = "orphan_entities_filtered"
RELATIONSHIPS_DROPPED_INVALID = "relationships_dropped_invalid"
RELATIONSHIPS_DROPPED_CAPPED = "relationships_dropped_capped"
CITATIONS_SKIPPED_NO_CHUNK_INDEX = "citations_skipped_no_chunk_index"
The string value matches the SQL column on sources. Keeping the enum
in lockstep with the columns means a typo in a stage's increment call
surfaces as a static type error rather than a silent miss.
Where counters fire
| Counter | Stage | Increment site (illustrative) |
|---|---|---|
loader_warnings_count | Loading | json_loader.py per-line parse failure; archive handlers when an entry is unreadable |
loader_files_skipped | Loading | archive_loader.py when an entry is rejected for size / extension / security |
cleaner_lines_removed | Normalization | OCRCleaner._remove_gibberish_lines and _remove_page_artifacts |
cleaner_paragraphs_deduplicated | Normalization | OCRCleaner._remove_duplicate_paragraphs |
cleaner_chars_removed | Normalization | TextCleaner.clean (net delta from before/after lengths) |
chunks_filtered_count | Chunking | ChunkingService._create_small_chunks — incremented per merge event when a sub-min_chunk_size chunk is coalesced into a neighbor (the column name predates the W5 follow-up; it now records merges, not drops) |
llm_chunks_truncated | LLM extraction | _consume_extraction_stream when finish_reason normalizes to length |
llm_chunks_aborted_by_loop | LLM extraction | _consume_extraction_stream when detector.aborted is true |
parser_lines_dropped | LLM extraction | line_parser.parse_extraction_output via the stats kwarg |
dedup_entities_merged | Post-extraction | EntityProcessor.deduplicate_entities_with_mapping |
structural_entities_filtered | Post-extraction | apply_structural_and_normalization (Cortex / Neuron / CLI / MCP) |
orphan_entities_filtered | Commit | commit/service.drop_orphan_entities |
relationships_dropped_invalid | Post-extraction / Commit | Index-validation passes in extractor + commit |
relationships_dropped_capped | Post-extraction | Per-entity / same-source-type / total-ratio cap enforcement |
citations_skipped_no_chunk_index | Commit | _create_source_citations and _create_relationship_citations when chunk_index is missing |
The increment helper
All increment sites go through a single typed entry point:
from chaoscypher_core.services.quality.counters import (
QualityCounter,
increment_quality_counter,
)
await increment_quality_counter(
adapter=adapter,
source_id=source_id,
database_name=database_name,
counter=QualityCounter.CLEANER_LINES_REMOVED,
n=lines_dropped,
)
The helper is async def purely so call sites in async pipelines can
await it inline without an asyncio.to_thread() dance — the
underlying SQL is synchronous. Failures are logged at WARNING and
swallowed; counter visibility is observability, not control flow.
The storage adapter implements two methods on
SourceLifecycleMixin:
| Method | Purpose |
|---|---|
increment_source_counter(*, source_id, database_name, column, n) | Atomic COALESCE(col, 0) + :n UPDATE on a single allowlisted column. The allowlist is the QualityCounter enum. |
update_source_columns(*, source_id, database_name, updates) | Bulk-set a dict of columns in one statement. Used by the encoding-set helper, the search-status transitions, and the reset path. |
Migration 0021
The counter columns plus the upload-settings columns and the
vector_indexed_at / vector_indexing_status fields all landed in
one migration: 0021_upload_settings_and_quality_counters.py. Adding
them together kept the schema in sync with the design intent of the
W1+W2 workstreams (what you set is what you get; nothing disappears
silently).
The migration adds, on the sources table:
- 5 upload-settings columns:
auto_analyze,enable_normalization,enable_vision,content_filtering,filtering_mode - 1 loader-encoding column:
loader_encoding_used - 15 counter columns (the StrEnum values above)
- 2 vector-search columns:
vector_indexed_at,vector_indexing_status
Reset-on-re-extract
force_re_extract in
chaoscypher_core.services.sources.management.re_extraction calls
reset_quality_counters(adapter, source_id, database_name), which
issues one update_source_columns setting every counter back to its
post-upload default:
- All 15 counters →
0 loader_encoding_used→nullvector_indexed_at→nullvector_indexing_status→"pending"
The reset is symmetric with the migration's column defaults — the row
ends up looking exactly like a freshly-uploaded source as far as the
counters are concerned. The cached quality grade (cached_quality_grade,
cached_avg_entity_quality, etc.) is not in the reset set; it gets
recomputed on the next finalize and overwrites the previous value.
Vector-search status transitions
vector_indexing_status is its own state machine, not a counter, but
it lives in the same module because the same migration introduced it
and the same update_source_columns write path manages it.
Helpers in chaoscypher_core.services.quality.counters:
mark_search_indexing_pending— start of post-commit indexingmark_search_indexing_indexed— both node + chunk vector writes succeeded; stampsvector_indexed_atmark_search_indexing_degraded— at least one indexing call raised; commit enqueued a retrymark_search_indexing_failed— sweep worker exhausted retries
All four helpers are best-effort — failure to write the status is logged and swallowed. Status is observability, not control flow.
See also
- Data Quality tab (user guide) — when to consult counters vs. the grade
- Quality Metrics API — field-level reference
- Search Status (user guide) — the four
vector_indexing_statusstates