Skip to main content

Sources API

Manage document sources -- upload, process, tag, and monitor extraction.

All endpoints are prefixed with /api/v1/sources unless noted otherwise.

Related pages

Upload & Import

Upload Single File

POST /api/v1/sources

Upload a document via multipart form data. Returns 202 Accepted immediately while indexing and extraction run in the background.

curl -X POST http://localhost:8080/api/v1/sources \
-F "file=@document.pdf"
ParameterTypeRequiredDefaultDescription
filefileYes--Document file to upload
extract_entitiesboolNotrueRun entity extraction after indexing
analysis_depthstringNofullExtraction depth: full or quick
domainstringNonullForce extraction domain (e.g. technical, generic). Auto-detected if omitted.
enable_normalizationboolNotrueNormalize content on upload (encoding fixes, whitespace, OCR cleaning). Disable for code or structured data.
enable_visionboolNoauto-detectEnable vision processing for images in PDFs and image files. Default: auto-detect based on vision model configuration.
content_filteringboolNotrueFilter non-essential content (TOC, legal, boilerplate) from entity extraction. Filtered content remains searchable via RAG.
filtering_modestringNobalancedStrictness of post-extraction filters: unfiltered, minimal, lenient, balanced, strict, maximum. See Filtering Modes.
skip_duplicatesboolNofalseSkip upload if identical content already exists (by SHA-256 hash)
Upload settings are persistent

auto_analyze, enable_normalization, enable_vision, content_filtering, and filtering_mode are persisted on the source row at upload time. Recovery, retry, and re-extract reuse the persisted values by default — clients only re-pass them when they want to override.

Response 202 Accepted -- SourceResponse

{
"id": "src_abc123",
"filename": "document.pdf",
"file_type": "pdf",
"file_size": 204800,
"status": "pending",
"enabled": true,
"extraction_depth": "full",
"created_at": "2026-03-09T12:00:00",
"updated_at": "2026-03-09T12:00:00"
}

Key fields shown above. The full response includes lifecycle timestamps (indexing_*, extraction_*, commit_*), LLM metrics (llm_total_calls, llm_total_input_tokens, etc.), and progress fields (current_step, step_description) — all initially null or 0. See SourceResponse for the complete schema.

Polling for progress

Use GET /api/v1/sources/{id} to poll the source status as it transitions through pending -> indexing -> indexed -> extracting -> extracted -> committing -> committed.


Batch Upload

POST /api/v1/sources/batch

Upload multiple files simultaneously. Returns 202 Accepted.

curl -X POST http://localhost:8080/api/v1/sources/batch \
-F "files=@doc1.pdf" \
-F "files=@doc2.pdf"
ParameterTypeRequiredDefaultDescription
filesfile[]Yes--Multiple document files
extract_entitiesboolNotrueRun entity extraction after indexing
analysis_depthstringNofullExtraction depth: full or quick
enable_normalizationboolNotrueNormalize content on upload
enable_visionboolNoauto-detectEnable vision processing for images in PDFs and image files
domainstringNonullForce extraction domain
content_filteringboolNotrueFilter non-essential content (TOC, legal, boilerplate) from entity extraction. Filtered content remains searchable via RAG.
filtering_modestringNobalancedStrictness of post-extraction filters. See Filtering Modes.
skip_duplicatesboolNofalseSkip files whose content already exists

Response 202 Accepted

{
"uploaded": 2,
"failed": 0,
"files": [
{ "id": "src_abc123", "filename": "doc1.pdf", "status": "pending", "..." : "..." },
{ "id": "src_def456", "filename": "doc2.pdf", "status": "pending", "..." : "..." }
],
"errors": []
}

Each item in files is a full SourceResponse. When a file fails, it appears in errors instead:

{
"uploaded": 1,
"failed": 1,
"files": [ { "..." : "..." } ],
"errors": [
{ "filename": "bad.xyz", "error": "Unsupported file type" }
]
}
Batch size limit

Returns 400 if the number of files exceeds the configured max_upload_files limit.


Import from URL

POST /api/v1/sources/url

Fetch a web page, extract clean markdown content, and process it through the standard file pipeline.

curl -X POST http://localhost:8080/api/v1/sources/url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article"}'
ParameterTypeRequiredDefaultDescription
urlstringYes--URL to import (must start with http:// or https://)
extract_entitiesboolNotrueRun entity extraction after indexing
analysis_depthstringNofullExtraction depth: full or quick
enable_normalizationboolNotrueNormalize content on upload
enable_visionboolNoauto-detectEnable vision processing for images in fetched HTML / PDFs
domainstringNonullForce extraction domain
content_filteringboolNotrueFilter non-essential content from entity extraction. Filtered content remains searchable via RAG.
filtering_modestringNobalancedStrictness of post-extraction filters. See Filtering Modes.
skip_duplicatesboolNofalseSkip if identical content exists
URL fetcher Content-Type validation

The URL fetcher validates the upstream Content-Type against the same allowlist used for direct file uploads (batching.allowed_content_types). It honors any charset=… parameter in the response header and routes binary responses (PDF, ZIP, DOCX, etc.) to the binary loader path so application/pdf URLs are no longer mishandled as HTML.

Response 202 Accepted -- SourceResponse

The response is identical in shape to the single file upload response. The source_type will be webpage and origin_url will contain the imported URL.

StatusDescription
400Invalid URL format
422Failed to fetch URL or content shorter than 50 characters

Source CRUD

List Sources

GET /api/v1/sources

Paginated list of sources with optional filters. Returns PaginatedSourcesResponse containing SourceSummaryResponse items.

curl "http://localhost:8080/api/v1/sources?status=committed"
ParameterTypeRequiredDefaultDescription
pageintNo1Page number (1-indexed)
page_sizeintNoServer default (50)Items per page (capped at max_page_size)
source_typestringNonullFilter by source type (pdf, text, csv, webpage, etc.)
statusstringNonullFilter by processing status (pending, indexing, indexed, extracting, extracted, committing, committed, error)
enabledstringNonullFilter by enabled state: enabled or disabled
searchstringNonullSearch in title and origin URL
tag_idstringNonullFilter by tag ID

Response 200 OK -- PaginatedSourcesResponse

{
"data": [
{
"id": "src_abc123",
"filename": "research-paper.pdf",
"file_type": "pdf",
"file_size": 204800,
"title": "A Research Paper",
"status": "committed",
"chunk_count": 42,
"extraction_entities_count": 85,
"extraction_relationships_count": 120,
"cached_quality_grade": "A",
"tags": [
{ "id": "tag_001", "name": "Research", "color": "#4dabf5" }
],
"created_at": "2026-03-09T10:00:00"
}
],
"pagination": {
"total": 1,
"page": 1,
"page_size": 20,
"total_pages": 1,
"has_next": false,
"has_prev": false
}
}

Each item is a SourceSummaryResponse with additional fields including embedding info, LLM metrics, duration timings, and quality scores.


Get Source

GET /api/v1/sources/{source_id}

Returns the full source detail including all lifecycle fields, LLM metrics, and user metadata.

curl http://localhost:8080/api/v1/sources/src_abc123
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 200 OK -- SourceResponse

See the Upload Single File section for a full response example.

StatusDescription
404Source not found

Update Source

PATCH /api/v1/sources/{source_id}

Update mutable source fields.

curl -X PATCH http://localhost:8080/api/v1/sources/src_abc123 \
-H "Content-Type: application/json" \
-d '{
"title": "Updated Title",
"enabled": true,
"user_metadata": { "category": "research" }
}'
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID
titlestringNoNew display title
processing_statusstringNoOverride status (ready or error)
enabledboolNoEnable or disable the source
user_metadataobjectNoArbitrary key-value metadata

Response 200 OK -- SourceResponse

StatusDescription
404Source not found

Delete Source

DELETE /api/v1/sources/{source_id}

Permanently deletes the source and cascades to all chunks, citations, graph nodes, edges, templates, and search index entries.

curl -X DELETE http://localhost:8080/api/v1/sources/src_abc123
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 204 No Content

StatusDescription
404Source not found

Source Metadata

List Extraction Domains

GET /api/v1/sources/domains

Returns available extraction domains for dropdown selection. Includes built-in domains and any per-database custom domains.

curl http://localhost:8080/api/v1/sources/domains

Response 200 OK

{
"domains": [
{
"name": "generic",
"description": "General-purpose entity extraction",
"builtin": true,
"extraction_density": "medium",
"prompt_tokens": 1200
},
{
"name": "technical",
"description": "Technical documentation and specifications",
"builtin": true,
"extraction_density": "high",
"prompt_tokens": 1800
}
]
}

Get Processing Stats

GET /api/v1/sources/stats

Aggregate processing statistics across all sources.

curl http://localhost:8080/api/v1/sources/stats

Response 200 OK

{
"total_files": 25,
"by_status": {
"committed": 20,
"indexed": 3,
"error": 2
},
"total_chunks": 1042,
"total_entities": 850,
"total_relationships": 1200
}

Extraction Management

Trigger Extraction

POST /api/v1/sources/{source_id}/extraction

Trigger manual entity extraction for a source. The source must be in indexed or extracted status. Returns 202 Accepted while extraction runs in the background.

curl -X POST http://localhost:8080/api/v1/sources/src_abc123/extraction \
-H "Content-Type: application/json" \
-d '{
"analysis_depth": "full",
"domain": "technical",
"force": false
}'
ParameterTypeRequiredDefaultDescription
source_idstring (path)Yes--Source ID
analysis_depthstringNofullExtraction depth: full or quick
domainstringNonullForce extraction domain. Auto-detected if omitted.
filtering_modestringNopersistedOverride the source's persisted filtering_mode for this run only.
forceboolNofalseRe-extract even if extraction results already exist

The endpoint reuses the source's persisted upload settings (filtering_mode, enable_vision, content_filtering) by default. Pass them in the body to override per-call without changing the row.

Response 202 Accepted

{
"source_id": "src_abc123",
"job_id": "job_xyz789",
"status": "queued",
"message": "Extraction started"
}
StatusDescription
400Source is not in an extractable state
404Source not found
409Extraction already in progress (use force=true to re-extract)

Get Extraction Progress

GET /api/v1/sources/{source_id}/extraction

Returns detailed extraction progress including job status, chunk-level counts, and timing estimates.

curl http://localhost:8080/api/v1/sources/src_abc123/extraction
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 200 OK

{
"source_id": "src_abc123",
"job_id": "job_xyz789",
"status": "running",
"has_extraction_job": true,
"total_chunks": 10,
"completed_chunks": 6,
"failed_chunks": 0,
"progress_percent": 60.0,
"chunks_by_status": {
"completed": 6,
"running": 1,
"queued": 3
},
"total_entities": 52,
"total_relationships": 78,
"extraction_depth": "full",
"started_at": "2026-03-09T10:00:00",
"completed_at": null,
"timing": {
"avg_duration_ms": 4200,
"min_duration_ms": 2100,
"max_duration_ms": 6800
},
"current_chunk": {
"chunk_index": 6,
"status": "running",
"started_at": "2026-03-09T10:02:30"
}
}

When no extraction job exists:

{
"source_id": "src_abc123",
"status": "indexed",
"has_extraction_job": false,
"message": "No active extraction job for this source"
}
StatusDescription
404Source not found

Cancel Extraction

DELETE /api/v1/sources/{source_id}/extraction

Cancels all pending and queued extraction chunks. Already running or completed chunks are not affected. Source status reverts to indexed (RAG search still works).

curl -X DELETE http://localhost:8080/api/v1/sources/src_abc123/extraction
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 204 No Content

StatusDescription
404Source not found or no active extraction job

Reclassify Source Domain

POST /api/v1/sources/{source_id}/reclassify

Change the extraction domain for a source and queue a new extraction pass. Returns 202 Accepted while the new extraction runs in the background.

For sources that are already committed, this endpoint atomically resets prior graph artifacts (nodes, edges, templates) before dispatching so the new extraction starts clean.

When to use: When auto-detection chose the wrong domain, or when you want to re-run extraction under a different domain template after reviewing the initial results. Prefer this over setting domain at upload time — reclassify decouples domain selection from the upload flow.

curl -X POST http://localhost:8080/api/v1/sources/src_abc123/reclassify \
-H "Content-Type: application/json" \
-d '{"domain": "medical"}'
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID
domainstringYesDomain name to use (e.g. medical, legal). See GET /sources/domains.

Response 202 Accepted

{
"source_id": "src_abc123",
"status": "extracting"
}
StatusDescription
400Source is not in a reclassifiable state (indexed or committed required)
404Source not found
503No LLM provider configured

List Extraction Tasks

GET /api/v1/sources/{source_id}/extraction/tasks

Paginated list of individual chunk extraction tasks (LLM processing groups). Useful for debugging and analytics.

curl "http://localhost:8080/api/v1/sources/src_abc123/extraction/tasks?page=1&page_size=20"
ParameterTypeRequiredDefaultDescription
source_idstring (path)Yes--Source ID
pageintNo1Page number (1-indexed)
page_sizeintNoServer defaultItems per page
include_contentboolNofalseInclude full input_text and llm_response_json (large payloads)

Response 200 OK -- ExtractionTaskListResponse

{
"tasks": [
{
"id": "task_001",
"job_id": "job_xyz789",
"chunk_index": 0,
"hierarchical_group_id": "group_a",
"small_chunk_ids": ["chunk_001", "chunk_002"],
"status": "completed",
"created_at": "2026-03-09T10:00:00",
"queued_at": "2026-03-09T10:00:01",
"started_at": "2026-03-09T10:00:05",
"completed_at": "2026-03-09T10:00:09",
"llm_duration_ms": 3800,
"retry_count": 0,
"entity_count": 8,
"relationship_count": 12,
"invalid_relationship_count": 1,
"small_chunk_numbers": [1, 2],
"input_text_length": 3200,
"llm_response_length": 1800,
"input_tokens": 1100,
"output_tokens": 620,
"context_window_available": 128000,
"input_text": null,
"llm_response_json": null,
"filtering_log": null,
"error_message": null,
"error_type": null
}
],
"total": 10,
"page": 1,
"page_size": 20
}
Content fields

Set include_content=true to populate input_text and llm_response_json. By default only their lengths are returned for performance.


Get Extraction Task

GET /api/v1/sources/{source_id}/extraction/tasks/{task_id}

Returns a single extraction task with full details, including content fields.

curl http://localhost:8080/api/v1/sources/src_abc123/extraction/tasks/task_001
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID
task_idstring (path)YesExtraction task ID

Response 200 OK -- ExtractionTaskResponse

Same shape as items in the task list, but with input_text, llm_response_json, and filtering_log fully populated.

StatusDescription
404Extraction task not found

Get Extraction Task Stats

GET /api/v1/sources/{source_id}/extraction/stats

Aggregate statistics (min/avg/max) for extraction tasks, computed via SQL aggregates without loading every row.

curl http://localhost:8080/api/v1/sources/src_abc123/extraction/stats
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 200 OK -- ExtractionTaskStatsResponse

{
"total_tasks": 10,
"context_window": 128000,
"min_input_tokens": 800,
"max_input_tokens": 2400,
"avg_input_tokens": 1500,
"min_output_tokens": 300,
"max_output_tokens": 900,
"avg_output_tokens": 600,
"min_total_tokens": 1100,
"max_total_tokens": 3300,
"avg_total_tokens": 2100,
"min_utilization": 0.86,
"max_utilization": 2.58,
"avg_utilization": 1.64,
"min_duration_ms": 2100,
"max_duration_ms": 6800,
"avg_duration_ms": 4200,
"total_entities": 85,
"avg_entities_per_task": 8.5,
"total_relationships": 120,
"avg_relationships_per_task": 12.0,
"total_retries": 2,
"max_retries_single_task": 1,
"total_invalid_relationships": 5,
"avg_invalid_per_task": 0.5,
"total_entities_filtered": 3,
"total_relationships_filtered": 7,
"filtering_stage_summary": [
{ "stage": "exact_dedup", "total_removed": 2, "chunk_count": 2 },
{ "stage": "relationship_dedup", "total_removed": 5, "chunk_count": 3 }
],
"system_prompt": "You are an entity extraction assistant...",
"extraction_rules_template": "...",
"entity_templates": "...",
"relationship_templates": "...",
"domain_guidance": "...",
"domain_examples": "..."
}
StatusDescription
404No extraction statistics available for this source

Get Extraction Chart Data

GET /api/v1/sources/{source_id}/extraction/charts

Returns all extraction tasks with minimal fields for UI chart rendering. No pagination -- returns all tasks at once for efficient charting.

curl http://localhost:8080/api/v1/sources/src_abc123/extraction/charts
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 200 OK

[
{
"chunk_index": 0,
"status": "completed",
"retry_count": 0,
"entity_count": 8,
"relationship_count": 12,
"input_text_length": 3200,
"llm_duration_ms": 3800
},
{
"chunk_index": 1,
"status": "completed",
"retry_count": 1,
"entity_count": 6,
"relationship_count": 9,
"input_text_length": 2800,
"llm_duration_ms": 5200
}
]

Get Cross-Chunk Filtering Log

GET /api/v1/sources/{source_id}/extraction/filteringlog

Returns the cross-chunk deduplication filtering log from the post-extraction merging stage. Shows entities and relationships removed during structural filtering, exact/semantic deduplication, and relationship deduplication.

curl http://localhost:8080/api/v1/sources/src_abc123/extraction/filteringlog
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 200 OK

{
"stages": [
{
"stage": "exact_dedup",
"removed_count": 3,
"details": ["Entity 'Python' duplicate removed", "..."]
}
],
"total_removed": 5
}
StatusDescription
404Source not found or no filtering log available

Retry Errored Source

POST /api/v1/sources/{source_id}/retry

Manually retry a source that is in error status. The retry target is determined by error_stage on the source record — the service routes the source back to the appropriate pipeline stage (indexing, extraction, or commit).

curl -X POST http://localhost:8080/api/v1/sources/src_abc123/retry
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 200 OK -- SourceResponse

Returns the updated source record after the retry has been queued.

StatusDescription
404Source not found
409Source is not in error status or error_stage is unknown

Re-extract Source

POST /api/v1/sources/{source_id}/re_extract

Manually re-run entity extraction on a source (distinct from retry). The key difference:

  • Retry preserves the cached extraction payload and re-runs only the failed stage (cheap — no additional LLM tokens for commit-only retries).
  • Re-extract discards the cached payload and any previous extraction results, resets the source to indexed, and re-runs the full LLM extraction (expensive — costs LLM tokens).

Use this when you want to re-analyze a document after changing the extraction domain, fixing domain-specific rules, or correcting the initial extraction output.

curl -X POST http://localhost:8080/api/v1/sources/src_abc123/re_extract
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 202 Accepted -- SourceResponse

Returns the updated source record after the re-extraction job has been queued.

Allowed source states:

  • committed — Atomically deletes graph artifacts (nodes, edges, templates), resets to indexed, and re-extracts.
  • error (after post-INDEXING stage) — Resets to indexed, clears cached payload, and re-extracts.
  • indexed / extracted / extracting / mcp_extracting / committing — Forcibly resets to indexed, clears payload, and re-extracts.

Rejected source states:

  • pending / indexing — Returns 422; the source has not yet produced extraction artifacts. Wait for indexing to complete, then retry.
Persisted settings carry over

Re-extract reuses the source's persisted upload settings (auto_analyze, enable_normalization, enable_vision, content_filtering, filtering_mode) by default — what you uploaded with is what you re-extract with. Clients can override any of these per call by passing them in the request body.

force_re_extract also resets every quality counter on the source row back to zero and clears vector_indexing_status to pending, so the new run starts with a clean counter set.

StatusDescription
404Source not found
422Source is in pending or indexing state (not yet indexable)

List Source Recovery Events

GET /api/v1/sources/{source_id}/recovery_events

Returns the recovery audit trail for a source — every automatic recovery attempt, what was dispatched, and when. Backs the source detail page's recovery panel so operators can diagnose repeated failures without grepping container logs. Events are returned newest first.

curl "http://localhost:8080/api/v1/sources/src_abc123/recovery_events?limit=20"
ParameterTypeRequiredDefaultDescription
source_idstring (path)Yes--Source ID
limitint (query)No50Maximum events to return (1--200)
note

Recovery events use a ?limit= cap (max 200) rather than the standard ?page=&page_size= pagination model — events are an audit trail, not a paged resource collection.

Response 200 OK

{
"events": [
{
"id": "rev_001",
"source_id": "src_abc123",
"event_type": "recovery",
"created_at": "2026-03-09T10:05:00",
"reason": "Stalled extraction detected",
"dispatched_operation": "OP_EXTRACT_SOURCE"
}
]
}
StatusDescription
404Source not found

Cleanup Orphan Chunk Tasks

POST /api/v1/sources/cleanup/orphan_tasks

Triggers an immediate sweep of orphaned chunk tasks — tasks whose parent extraction job completed or failed but whose rows were not updated. Normally run automatically on a schedule; use this endpoint to trigger it on demand after bulk operations or during recovery.

curl -X POST http://localhost:8080/api/v1/sources/cleanup/orphan_tasks

Response 200 OK

{
"deleted_count": 12,
"retention_days": 7
}
FieldTypeDescription
deleted_countintNumber of orphaned task rows removed
retention_daysintConfigured orphan retention window (from SourceRecoverySettings)

Abort All Processing

DELETE /api/v1/sources/{source_id}/processing

Cancels all queued/running tasks (indexing or extraction) and resets the source status appropriately.

curl -X DELETE http://localhost:8080/api/v1/sources/src_abc123/processing
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 204 No Content

StatusDescription
400Source is not in a processing state
404Source not found
Status after abort
  • pending / indexing -> error (with message "Processing/Indexing aborted by user")
  • extracting -> indexed (RAG still usable)
  • committing -> extracted

Chunks

List Chunks

GET /api/v1/sources/{source_id}/chunks

Paginated list of document chunks for a source.

curl "http://localhost:8080/api/v1/sources/src_abc123/chunks?page=1&page_size=20"
ParameterTypeRequiredDefaultDescription
source_idstring (path)Yes--Source ID
pageintNo1Page number (1-indexed)
page_sizeintNoServer defaultItems per page
statusstringNonullFilter by chunk status

Response 200 OK -- ChunkListResponse

{
"chunks": [
{
"id": "chunk_001",
"source_id": "src_abc123",
"chunk_index": 0,
"content": "This is the first chunk of the document...",
"page_number": 1,
"section": "Introduction",
"group_index": 0,
"status": "indexed",
"created_at": "2026-03-09T10:00:05"
}
],
"total": 42,
"page": 1,
"page_size": 20
}

Get Chunk

GET /api/v1/sources/{source_id}/chunks/{chunk_id}

Returns a single chunk by ID.

curl http://localhost:8080/api/v1/sources/src_abc123/chunks/chunk_001
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID
chunk_idstring (path)YesChunk ID

Response 200 OK -- ChunkResponse

{
"id": "chunk_001",
"source_id": "src_abc123",
"chunk_index": 0,
"content": "This is the first chunk of the document...",
"page_number": 1,
"section": "Introduction",
"group_index": 0,
"status": "indexed",
"created_at": "2026-03-09T10:00:05"
}
StatusDescription
404Chunk not found or does not belong to this source

Citations

List Citations

GET /api/v1/sources/{source_id}/citations

Paginated list of entity citations (attributions) for a source. Each citation links an extracted entity back to the source chunk it was found in.

curl "http://localhost:8080/api/v1/sources/src_abc123/citations?page=1&page_size=20"
ParameterTypeRequiredDefaultDescription
source_idstring (path)Yes--Source ID
pageintNo1Page number (1-indexed)
page_sizeintNoServer defaultItems per page

Response 200 OK -- CitationListResponse

{
"citations": [
{
"id": "cit_001",
"entity_uri": "urn:chaoscypher:node:abc123",
"entity_label": "Python",
"entity_type": "Programming Language",
"source_id": "src_abc123",
"chunk_id": "chunk_001",
"confidence": 0.95,
"extraction_method": "llm",
"context_snippet": "...Python is a versatile programming language...",
"created_at": "2026-03-09T10:05:00"
}
],
"total": 85,
"page": 1,
"page_size": 20
}

Source Data Access

Get Source Stats

GET /api/v1/sources/{source_id}/stats

Returns computed statistics for a single source.

curl http://localhost:8080/api/v1/sources/src_abc123/stats
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 200 OK

{
"chunk_count": 42,
"citation_count": 85,
"entity_count": 65,
"relationship_count": 120,
"total_content_length": 52000,
"avg_chunk_length": 1238
}
StatusDescription
404Source not found

Get Source Entities

GET /api/v1/sources/{source_id}/entities

Paginated list of entities extracted from the document. Each entity includes a computed quality_score (0-100).

curl "http://localhost:8080/api/v1/sources/src_abc123/entities?page=1&page_size=20&sort_by=quality&sort_order=desc"
ParameterTypeRequiredDefaultDescription
source_idstring (path)Yes--Source ID
pageintNo1Page number (1-indexed)
page_sizeintNoServer defaultItems per page
sort_bystringNodefaultSort field: default, quality, confidence, name, type
sort_orderstringNodescSort direction: asc or desc

Response 200 OK

{
"entities": [
{
"name": "Python",
"type": "ProgrammingLanguage",
"confidence": 0.95,
"description": "A versatile programming language",
"source_chunks": [0, 3, 7],
"quality_score": 92.5
}
],
"pagination": {
"page": 1,
"page_size": 20,
"total": 85,
"total_pages": 5,
"has_next": true,
"has_prev": false
}
}
StatusDescription
404Source not found

Get Source Relationships

GET /api/v1/sources/{source_id}/relationships

Paginated list of relationships extracted from the document. Each relationship is enriched with human-readable from and to entity names.

curl "http://localhost:8080/api/v1/sources/src_abc123/relationships?page=1&page_size=20"
ParameterTypeRequiredDefaultDescription
source_idstring (path)Yes--Source ID
pageintNo1Page number (1-indexed)
page_sizeintNoServer defaultItems per page

Response 200 OK

{
"relationships": [
{
"source": 0,
"target": 5,
"type": "USES",
"confidence": 0.88,
"from": "FastAPI",
"to": "Python"
}
],
"pagination": {
"page": 1,
"page_size": 20,
"total": 120,
"total_pages": 6,
"has_next": true,
"has_prev": false
}
}
StatusDescription
404Source not found

Get Source Templates

GET /api/v1/sources/{source_id}/templates

Paginated list of graph templates created from extraction of this source.

curl "http://localhost:8080/api/v1/sources/src_abc123/templates?page=1&page_size=20&template_type=node"
ParameterTypeRequiredDefaultDescription
source_idstring (path)Yes--Source ID
template_typestringNonullFilter by type: node or edge
pageintNo1Page number (1-indexed)
page_sizeintNoServer defaultItems per page

Response 200 OK

{
"templates": [
{
"id": "template_abc123",
"name": "ProgrammingLanguage",
"type": "node",
"source_id": "src_abc123",
"properties": ["name", "paradigm", "version"]
}
],
"pagination": {
"page": 1,
"page_size": 20,
"total": 12,
"total_pages": 1,
"has_next": false,
"has_prev": false
}
}
StatusDescription
404Source not found

Get Source LLM Metrics

GET /api/v1/sources/{source_id}/llm_metrics

Summary of LLM usage metrics for a source, including call counts, token consumption, cost estimates, and derived rates.

curl http://localhost:8080/api/v1/sources/src_abc123/llm_metrics
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 200 OK

{
"source_id": "src_abc123",
"has_metrics": true,
"summary": {
"total_calls": 6,
"successful_calls": 6,
"failed_calls": 0,
"retry_calls": 1,
"first_try_successes": 5,
"retry_successes": 1,
"permanent_failures": 0,
"total_input_tokens": 24000,
"total_output_tokens": 8500,
"wasted_tokens": 400,
"avg_call_duration_ms": 4200,
"total_duration_ms": 25200,
"estimated_cost_usd": 0.0325,
"error_counts": {},
"model": "gpt-4o",
"success_rate": 1.0,
"retry_rate": 0.167,
"waste_percentage": 0.012
}
}
StatusDescription
404Source not found

List Source LLM Calls

GET /api/v1/sources/{source_id}/llm_metrics/calls

Paginated list of individual LLM API calls made during extraction of this source.

curl "http://localhost:8080/api/v1/sources/src_abc123/llm_metrics/calls?page=1&page_size=20&success=true"
ParameterTypeRequiredDefaultDescription
source_idstring (path)Yes--Source ID
pageintNo1Page number (1-indexed)
page_sizeintNoServer defaultItems per page
successboolNonullFilter by success status

Response 200 OK

{
"calls": [
{
"id": "call_001",
"source_id": "src_abc123",
"success": true,
"input_tokens": 1100,
"output_tokens": 620,
"duration_ms": 3800,
"model": "gpt-4o",
"created_at": "2026-03-09T10:00:05"
}
],
"pagination": {
"page": 1,
"page_size": 20,
"total": 6,
"total_pages": 1,
"has_next": false,
"has_prev": false
}
}
StatusDescription
404Source not found

Tags

Tag endpoints are mounted at /api/v1/sources/tags for tag CRUD, and nested under individual sources for tag assignment.

List All Tags

GET /api/v1/sources/tags

Returns all tags in the current database.

curl http://localhost:8080/api/v1/sources/tags

Response 200 OK -- list[TagResponse]

[
{
"id": "tag_001",
"database_name": "default",
"name": "Research",
"color": "#4dabf5",
"description": "Research papers",
"created_at": "2026-03-01T08:00:00"
},
{
"id": "tag_002",
"database_name": "default",
"name": "Technical",
"color": "#66bb6a",
"description": null,
"created_at": "2026-03-02T09:00:00"
}
]

Get Tag

GET /api/v1/sources/tags/{tag_id}

Returns a single tag by ID.

curl http://localhost:8080/api/v1/sources/tags/tag_001
ParameterTypeRequiredDescription
tag_idstring (path)YesTag ID

Response 200 OK -- TagResponse

{
"id": "tag_001",
"database_name": "default",
"name": "Research",
"color": "#4dabf5",
"description": "Research papers",
"created_at": "2026-03-01T08:00:00"
}
StatusDescription
404Tag not found

Create Tag

POST /api/v1/sources/tags

Create a new tag.

curl -X POST http://localhost:8080/api/v1/sources/tags \
-H "Content-Type: application/json" \
-d '{
"name": "Research",
"color": "#4dabf5",
"description": "Research papers"
}'
ParameterTypeRequiredDefaultDescription
namestringYes--Tag display name
colorstringNonullHex color code (e.g. #4dabf5)
descriptionstringNonullTag description

Response 201 Created -- TagResponse

StatusDescription
400Duplicate tag name or validation error

Update Tag

PATCH /api/v1/sources/tags/{tag_id}

Update tag properties. All fields are optional.

curl -X PATCH http://localhost:8080/api/v1/sources/tags/tag_001 \
-H "Content-Type: application/json" \
-d '{ "name": "Updated Name", "color": "#ff5722" }'
ParameterTypeRequiredDescription
tag_idstring (path)YesTag ID
namestringNoUpdated tag name
colorstringNoUpdated hex color
descriptionstringNoUpdated description

Response 200 OK -- TagResponse

StatusDescription
400Duplicate tag name or validation error
404Tag not found

Delete Tag

DELETE /api/v1/sources/tags/{tag_id}

Delete a tag. Removes the tag and all source-tag associations.

curl -X DELETE http://localhost:8080/api/v1/sources/tags/tag_001
ParameterTypeRequiredDescription
tag_idstring (path)YesTag ID

Response 204 No Content

StatusDescription
404Tag not found

Source Tag Assignment

List Tags for Source

GET /api/v1/sources/{source_id}/tags

Returns all tags assigned to a specific source.

curl http://localhost:8080/api/v1/sources/src_abc123/tags
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 200 OK -- list[TagResponse]

[
{
"id": "tag_001",
"database_name": "default",
"name": "Research",
"color": "#4dabf5",
"description": "Research papers",
"created_at": "2026-03-01T08:00:00"
}
]

Assign Tag to Source

POST /api/v1/sources/{source_id}/tags/{tag_id}

Assign a tag to a source.

curl -X POST http://localhost:8080/api/v1/sources/src_abc123/tags/tag_001
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID
tag_idstring (path)YesTag ID

Response 204 No Content

StatusDescription
404Source or tag not found

Remove Tag from Source

DELETE /api/v1/sources/{source_id}/tags/{tag_id}

Remove a tag from a source.

curl -X DELETE http://localhost:8080/api/v1/sources/src_abc123/tags/tag_001
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID
tag_idstring (path)YesTag ID

Response 204 No Content

StatusDescription
404Tag assignment not found

Page Images

Rendered page images are generated for PDF sources when vision processing is enabled. Images are stored per-source under data/databases/{db_name}/images/{source_id}/ and served directly by the API.

List Source Images

GET /api/v1/sources/{source_id}/images

Returns a list of available rendered page images for a source document. Returns an empty list if no images have been generated.

curl http://localhost:8080/api/v1/sources/src_abc123/images
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID

Response 200 OK -- list[object]

[
{
"filename": "page_1.png",
"url": "/sources/src_abc123/images/page_1.png"
},
{
"filename": "page_2.png",
"url": "/sources/src_abc123/images/page_2.png"
}
]

Images are sorted by filename. The url field is the path to pass to the Get Source Image endpoint.


Get Source Image

GET /api/v1/sources/{source_id}/images/{filename}

Serve a specific rendered page image. Returns the image as image/png. Path traversal is prevented -- the resolved path must remain within the source image directory.

curl http://localhost:8080/api/v1/sources/src_abc123/images/page_1.png \
--output page_1.png
ParameterTypeRequiredDescription
source_idstring (path)YesSource ID
filenamestring (path)YesImage filename (e.g. page_1.png)

Response 200 OK -- PNG image binary (Content-Type: image/png)

StatusDescription
403Path traversal attempt detected
404Image file not found

Response Models

SourceResponse

Full source detail model returned by get, create, and update endpoints. Contains all lifecycle fields across indexing, extraction, commit, and LLM metrics stages.

FieldTypeDescription
idstringSource ID
database_namestringDatabase this source belongs to
filenamestringOriginal filename
filepathstring?Storage path
file_typestring?File type (e.g. pdf, csv)
file_sizeint?File size in bytes
titlestring?Display title
source_typestring?Source type (e.g. pdf, webpage)
origin_urlstring?Original URL for web imports
versionintVersion number (default 1)
parent_idstring?Parent source ID
statusstringLifecycle status: pending | indexing | indexed | extracting | extracted | committing | committed | error
enabledboolWhether the source is active
error_messagestring?Error description if status is error
error_stagestring?Stage where error occurred
chunk_countintNumber of chunks created
total_content_lengthintTotal character count of all chunks
embedding_modelstring?Embedding model used
embedding_dimensionsint?Embedding vector dimensions
indexing_started_atdatetime?When indexing started
indexing_completed_atdatetime?When indexing finished
indexing_duration_secondsfloat?Calculated indexing duration
extraction_depthstring?Extraction depth (full or quick)
extraction_entities_countintEntities extracted
extraction_relationships_countintRelationships extracted
extraction_domainstring?Domain used for extraction
extraction_domain_autoboolWhether domain was auto-detected
extraction_started_atdatetime?When extraction started
extraction_completed_atdatetime?When extraction finished
extraction_duration_secondsfloat?Calculated extraction duration
current_extraction_job_idstring?Active extraction job ID
commit_started_atdatetime?When commit started
commit_completed_atdatetime?When commit finished
commit_duration_secondsfloat?Calculated commit duration
commit_nodes_createdintGraph nodes committed
commit_edges_createdintGraph edges committed
commit_templates_createdintTemplates created
current_stepint?Current processing step number
total_stepsint?Total processing steps
step_descriptionstring?Current step label
llm_total_callsintTotal LLM API calls
llm_successful_callsintSuccessful LLM calls
llm_failed_callsintFailed LLM calls
llm_retry_callsintRetried LLM calls
llm_first_try_successesintCalls that succeeded on first attempt
llm_retry_successesintCalls that succeeded after retry
llm_permanent_failuresintCalls that permanently failed
llm_total_input_tokensintTotal input tokens consumed
llm_total_output_tokensintTotal output tokens generated
llm_wasted_tokensintTokens wasted on failed calls
llm_avg_call_duration_msint?Average call duration in ms
llm_total_duration_msintTotal LLM call duration in ms
llm_estimated_cost_usdfloat?Estimated cost in USD
llm_error_countsobject?Error type breakdown
llm_modelstring?LLM model used
created_atdatetimeCreation timestamp
updated_atdatetimeLast update timestamp
user_metadataobject?User-defined metadata
upload_optionsobjectPersisted upload settings — see UploadOptions
quality_metricsobjectPer-stage quality counters and loader/search status — see QualityMetrics. Full reference: Quality Metrics API.
vector_indexing_statusstringOne of pending, indexed, degraded, failed. Mirrored at the top level for convenience; also lives inside quality_metrics. See Search Status.

UploadOptions

The settings the user (or default) supplied at upload time, persisted on the source row so recovery, retry, and re-extract honor them without the client having to re-pass them.

FieldTypeDescription
auto_analyzeboolAuto-queue extraction after indexing finishes
enable_normalizationbool?null = use file-type default; true/false = user override
enable_visionboolUse the vision model on images and scanned PDFs
content_filteringboolApply domain content-exclusion rules during extraction
filtering_modestringunfiltered / minimal / lenient / balanced / strict / maximum

QualityMetrics

The 15 quality counters plus three companion fields. Counters reset to zero on Re-extract (force_re_extract); the quality grade itself is not affected.

FieldTypeDescription
loader_encoding_usedstring?Encoding the loader actually used (utf-8, cp1252, latin-1-fallback, etc.). null until the loader runs.
loader_warnings_countintNon-fatal loader hiccups (e.g. a single bad JSONL line)
loader_files_skippedintArchive entries skipped (unsupported / oversized / security violation)
cleaner_lines_removedintLines dropped by the OCR cleaner (gibberish, page numbers, repeated headers)
cleaner_paragraphs_deduplicatedintDuplicate paragraphs collapsed by the cleaner
cleaner_chars_removedintNet character delta from text cleaning (encoding fixes, control chars, whitespace)
chunks_filtered_countintChunks the content-filter stripped to under 100 chars before extraction
llm_chunks_truncatedintChunks where the LLM hit its token cap
llm_chunks_aborted_by_loopintChunks where the streaming loop detector aborted the LLM
parser_lines_droppedintMalformed E|/R|/P| lines the parser couldn't make sense of
dedup_entities_mergedintEntities collapsed into another entity by exact-name or semantic dedup
structural_entities_filteredintEntities representing document structure removed by the structural filter
orphan_entities_filteredintEntities with zero relationships dropped at commit time
relationships_dropped_invalidintRelationships pointing at non-existent entity indices
relationships_dropped_cappedintRelationships dropped by the per-entity / same-source-type / total-ratio caps
citations_skipped_no_chunk_indexintCitations skipped at commit because the underlying entity / relationship had no chunk index
vector_indexed_atdatetime?When the vector indexing call succeeded; null if not yet
vector_indexing_statusstringpending, indexed, degraded, or failed

SourceSummaryResponse

Lightweight model used in list views. Excludes large payload fields like user_metadata, detailed timestamps, and full LLM metrics. Includes a tags array enriched at the API layer.

Refer to the List Sources response example for the full shape.


PaginatedSourcesResponse

Pagination wrapper for source list responses.

FieldTypeDescription
dataSourceSummaryResponse[]Page of source summaries
paginationobjectPagination metadata (total, page, page_size, total_pages, has_next, has_prev)

ChunkResponse

FieldTypeDescription
idstringChunk ID
source_idstring?Parent source ID
chunk_indexintPosition in the document (0-indexed)
contentstringChunk text content
page_numberint?PDF page number
sectionstring?Section heading
group_indexint?Hierarchical group index
statusstringChunk status
created_atdatetimeCreation timestamp

ChunkListResponse

FieldTypeDescription
chunksChunkResponse[]Page of chunks
totalintTotal chunks for this source
pageintCurrent page number
page_sizeintItems per page

CitationResponse

FieldTypeDescription
idstringCitation ID
entity_uristringURI of the cited entity
entity_labelstringHuman-readable entity name
entity_typestring?Entity template name
source_idstringSource this citation belongs to
chunk_idstringChunk where the entity was found
confidencefloatExtraction confidence (0.0--1.0)
extraction_methodstringHow the entity was extracted (e.g. llm)
context_snippetstring?Text snippet surrounding the entity mention
created_atdatetimeCreation timestamp

CitationListResponse

FieldTypeDescription
citationsCitationResponse[]Page of citations
totalintTotal citations for this source
pageintCurrent page number
page_sizeintItems per page

ExtractionTaskResponse

FieldTypeDescription
idstringTask ID
job_idstringParent extraction job ID
chunk_indexintChunk group index
hierarchical_group_idstring?Group identifier for hierarchical chunking
small_chunk_idsstring[]?Individual chunk IDs in this group
statusstringpending | queued | running | completed | failed
created_atdatetimeCreation timestamp
queued_atdatetime?When queued for processing
started_atdatetime?When LLM processing started
completed_atdatetime?When processing finished
llm_duration_msint?LLM call duration in ms
retry_countintNumber of retries
entity_countintEntities extracted by this task
relationship_countintRelationships extracted
invalid_relationship_countintInvalid relationships filtered out
small_chunk_numbersint[]?1-indexed chunk numbers for UI display
input_text_lengthint?Input text character count
llm_response_lengthint?LLM response character count
input_tokensint?Actual input token count from LLM API
output_tokensint?Actual output token count from LLM API
context_window_availableint?Model context window size
input_textstring?Full input text (only in detail view or when include_content=true)
llm_response_jsonstring?Raw LLM JSON response (only in detail view or when include_content=true)
filtering_logobject?Per-chunk filtering diagnostics (detail view only)
finish_reasonstring?Normalized provider finish reason: stop, length, content_filter, tool_calls, error, unknown. null for tasks that predate the field (migration 0022).
aborted_by_loopbool?true when the streaming loop detector aborted the LLM mid-response. Tasks predating migration 0022 carry null.
error_messagestring?Error message if failed
error_typestring?Error classification

ExtractionTaskListResponse

FieldTypeDescription
tasksExtractionTaskResponse[]Page of extraction tasks
totalintTotal tasks for this source
pageintCurrent page number
page_sizeintItems per page

ExtractionTaskStatsResponse

Aggregate statistics computed via SQL aggregates across all extraction tasks.

FieldTypeDescription
total_tasksintTotal extraction tasks
context_windowint?LLM context window size
min_input_tokensint?Minimum input tokens across tasks
max_input_tokensint?Maximum input tokens
avg_input_tokensint?Average input tokens
min_output_tokensint?Minimum output tokens
max_output_tokensint?Maximum output tokens
avg_output_tokensint?Average output tokens
min_total_tokensint?Minimum total tokens (input + output)
max_total_tokensint?Maximum total tokens
avg_total_tokensint?Average total tokens
min_utilizationfloat?Minimum context window utilization %
max_utilizationfloat?Maximum utilization %
avg_utilizationfloat?Average utilization %
min_duration_msint?Minimum LLM call duration in ms
max_duration_msint?Maximum duration
avg_duration_msint?Average duration
total_entitiesintTotal entities across all tasks
avg_entities_per_taskfloatAverage entities per task
total_relationshipsintTotal relationships
avg_relationships_per_taskfloatAverage relationships per task
total_retriesintTotal retry attempts
max_retries_single_taskintMost retries on a single task
total_invalid_relationshipsintTotal invalid relationships filtered
avg_invalid_per_taskfloatAverage invalid relationships per task
total_entities_filteredintEntities removed by pipeline filtering
total_relationships_filteredintRelationships removed by filtering
filtering_stage_summaryobject[]?Per-stage filtering breakdown
system_promptstring?System prompt used for extraction
extraction_rules_templatestring?Extraction rules portion of the prompt
entity_templatesstring?Entity template portion
relationship_templatesstring?Relationship template portion
domain_guidancestring?Domain-specific guidance
domain_examplesstring?Domain-specific examples

TagResponse

FieldTypeDescription
idstringTag ID
database_namestringDatabase this tag belongs to
namestringTag display name
colorstring?Hex color code
descriptionstring?Tag description
created_atdatetimeCreation timestamp