Skip to main content

Quality counters

The quality counter system records, on every source row, what the extraction pipeline silently dropped, deduplicated, or merged. Fifteen typed counters cover every silent-drop site from loader to commit. This page is the internal architecture reference; for the user-facing explanation see the Data Quality tab in the user guide and the Quality Metrics API for the field-level reference.

Why counters

Every stage of the pipeline has at least one defensible reason to remove content — duplicate paragraphs, boilerplate chunks, malformed LLM lines, relationships pointing at non-existent entities. Pre-W2, those drops were invisible. The quality grade told you the graph was small without telling you whether 8 entities meant "the source was short" or "the cleaner ate 90% of the input."

Counters make every silent drop legible without changing pipeline behaviour: they are pure observability, never control flow.

The QualityCounter enum

Every counter column lives in one place — the QualityCounter StrEnum in chaoscypher_core.services.quality.counters:

class QualityCounter(StrEnum):
LOADER_WARNINGS = "loader_warnings_count"
LOADER_FILES_SKIPPED = "loader_files_skipped"
CLEANER_LINES_REMOVED = "cleaner_lines_removed"
CLEANER_PARAGRAPHS_DEDUPLICATED = "cleaner_paragraphs_deduplicated"
CLEANER_CHARS_REMOVED = "cleaner_chars_removed"
CHUNKS_FILTERED = "chunks_filtered_count"
LLM_CHUNKS_TRUNCATED = "llm_chunks_truncated"
LLM_CHUNKS_ABORTED_BY_LOOP = "llm_chunks_aborted_by_loop"
PARSER_LINES_DROPPED = "parser_lines_dropped"
DEDUP_ENTITIES_MERGED = "dedup_entities_merged"
STRUCTURAL_ENTITIES_FILTERED = "structural_entities_filtered"
ORPHAN_ENTITIES_FILTERED = "orphan_entities_filtered"
RELATIONSHIPS_DROPPED_INVALID = "relationships_dropped_invalid"
RELATIONSHIPS_DROPPED_CAPPED = "relationships_dropped_capped"
CITATIONS_SKIPPED_NO_CHUNK_INDEX = "citations_skipped_no_chunk_index"

The string value matches the SQL column on sources. Keeping the enum in lockstep with the columns means a typo in a stage's increment call surfaces as a static type error rather than a silent miss.

Where counters fire

CounterStageIncrement site (illustrative)
loader_warnings_countLoadingjson_loader.py per-line parse failure; archive handlers when an entry is unreadable
loader_files_skippedLoadingarchive_loader.py when an entry is rejected for size / extension / security
cleaner_lines_removedNormalizationOCRCleaner._remove_gibberish_lines and _remove_page_artifacts
cleaner_paragraphs_deduplicatedNormalizationOCRCleaner._remove_duplicate_paragraphs
cleaner_chars_removedNormalizationTextCleaner.clean (net delta from before/after lengths)
chunks_filtered_countChunkingChunkingService._create_small_chunks — incremented per merge event when a sub-min_chunk_size chunk is coalesced into a neighbor (the column name predates the W5 follow-up; it now records merges, not drops)
llm_chunks_truncatedLLM extraction_consume_extraction_stream when finish_reason normalizes to length
llm_chunks_aborted_by_loopLLM extraction_consume_extraction_stream when detector.aborted is true
parser_lines_droppedLLM extractionline_parser.parse_extraction_output via the stats kwarg
dedup_entities_mergedPost-extractionEntityProcessor.deduplicate_entities_with_mapping
structural_entities_filteredPost-extractionapply_structural_and_normalization (Cortex / Neuron / CLI / MCP)
orphan_entities_filteredCommitcommit/service.drop_orphan_entities
relationships_dropped_invalidPost-extraction / CommitIndex-validation passes in extractor + commit
relationships_dropped_cappedPost-extractionPer-entity / same-source-type / total-ratio cap enforcement
citations_skipped_no_chunk_indexCommit_create_source_citations and _create_relationship_citations when chunk_index is missing

The increment helper

All increment sites go through a single typed entry point:

from chaoscypher_core.services.quality.counters import (
QualityCounter,
increment_quality_counter,
)

await increment_quality_counter(
adapter=adapter,
source_id=source_id,
database_name=database_name,
counter=QualityCounter.CLEANER_LINES_REMOVED,
n=lines_dropped,
)

The helper is async def purely so call sites in async pipelines can await it inline without an asyncio.to_thread() dance — the underlying SQL is synchronous. Failures are logged at WARNING and swallowed; counter visibility is observability, not control flow.

The storage adapter implements two methods on SourceLifecycleMixin:

MethodPurpose
increment_source_counter(*, source_id, database_name, column, n)Atomic COALESCE(col, 0) + :n UPDATE on a single allowlisted column. The allowlist is the QualityCounter enum.
update_source_columns(*, source_id, database_name, updates)Bulk-set a dict of columns in one statement. Used by the encoding-set helper, the search-status transitions, and the reset path.

Migration 0021

The counter columns plus the upload-settings columns and the vector_indexed_at / vector_indexing_status fields all landed in one migration: 0021_upload_settings_and_quality_counters.py. Adding them together kept the schema in sync with the design intent of the W1+W2 workstreams (what you set is what you get; nothing disappears silently).

The migration adds, on the sources table:

  • 5 upload-settings columns: auto_analyze, enable_normalization, enable_vision, content_filtering, filtering_mode
  • 1 loader-encoding column: loader_encoding_used
  • 15 counter columns (the StrEnum values above)
  • 2 vector-search columns: vector_indexed_at, vector_indexing_status

Reset-on-re-extract

force_re_extract in chaoscypher_core.services.sources.management.re_extraction calls reset_quality_counters(adapter, source_id, database_name), which issues one update_source_columns setting every counter back to its post-upload default:

  • All 15 counters → 0
  • loader_encoding_usednull
  • vector_indexed_atnull
  • vector_indexing_status"pending"

The reset is symmetric with the migration's column defaults — the row ends up looking exactly like a freshly-uploaded source as far as the counters are concerned. The cached quality grade (cached_quality_grade, cached_avg_entity_quality, etc.) is not in the reset set; it gets recomputed on the next finalize and overwrites the previous value.

Vector-search status transitions

vector_indexing_status is its own state machine, not a counter, but it lives in the same module because the same migration introduced it and the same update_source_columns write path manages it.

Helpers in chaoscypher_core.services.quality.counters:

  • mark_search_indexing_pending — start of post-commit indexing
  • mark_search_indexing_indexed — both node + chunk vector writes succeeded; stamps vector_indexed_at
  • mark_search_indexing_degraded — at least one indexing call raised; commit enqueued a retry
  • mark_search_indexing_failed — sweep worker exhausted retries

All four helpers are best-effort — failure to write the status is logged and swallowed. Status is observability, not control flow.

See also