Sources API
Manage document sources -- upload, process, tag, and monitor extraction.
All endpoints are prefixed with /api/v1/sources unless noted otherwise.
- User guide: Sources — how to upload and manage sources in the UI, CLI, and Python SDK
- Architecture: Extraction Pipeline — how the multi-stage pipeline works internally
Upload & Import
Upload Single File
POST /api/v1/sources
Upload a document via multipart form data. Returns 202 Accepted immediately while
indexing and extraction run in the background.
curl -X POST http://localhost:8080/api/v1/sources \
-F "file=@document.pdf"
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file | file | Yes | -- | Document file to upload |
extract_entities | bool | No | true | Run entity extraction after indexing |
analysis_depth | string | No | full | Extraction depth: full or quick |
domain | string | No | null | Force extraction domain (e.g. technical, generic). Auto-detected if omitted. |
enable_normalization | bool | No | true | Normalize content on upload (encoding fixes, whitespace, OCR cleaning). Disable for code or structured data. |
enable_vision | bool | No | auto-detect | Enable vision processing for images in PDFs and image files. Default: auto-detect based on vision model configuration. |
content_filtering | bool | No | true | Filter non-essential content (TOC, legal, boilerplate) from entity extraction. Filtered content remains searchable via RAG. |
filtering_mode | string | No | balanced | Strictness of post-extraction filters: unfiltered, minimal, lenient, balanced, strict, maximum. See Filtering Modes. |
skip_duplicates | bool | No | false | Skip upload if identical content already exists (by SHA-256 hash) |
auto_analyze, enable_normalization, enable_vision, content_filtering, and filtering_mode are persisted on the source row at upload time. Recovery, retry, and re-extract reuse the persisted values by default — clients only re-pass them when they want to override.
Response 202 Accepted -- SourceResponse
{
"id": "src_abc123",
"filename": "document.pdf",
"file_type": "pdf",
"file_size": 204800,
"status": "pending",
"enabled": true,
"extraction_depth": "full",
"created_at": "2026-03-09T12:00:00",
"updated_at": "2026-03-09T12:00:00"
}
Key fields shown above. The full response includes lifecycle timestamps (indexing_*, extraction_*, commit_*), LLM metrics (llm_total_calls, llm_total_input_tokens, etc.), and progress fields (current_step, step_description) — all initially null or 0. See SourceResponse for the complete schema.
Use GET /api/v1/sources/{id} to poll the source status as it transitions
through pending -> indexing -> indexed -> extracting -> extracted -> committing -> committed.
Batch Upload
POST /api/v1/sources/batch
Upload multiple files simultaneously. Returns 202 Accepted.
curl -X POST http://localhost:8080/api/v1/sources/batch \
-F "files=@doc1.pdf" \
-F "files=@doc2.pdf"
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
files | file[] | Yes | -- | Multiple document files |
extract_entities | bool | No | true | Run entity extraction after indexing |
analysis_depth | string | No | full | Extraction depth: full or quick |
enable_normalization | bool | No | true | Normalize content on upload |
enable_vision | bool | No | auto-detect | Enable vision processing for images in PDFs and image files |
domain | string | No | null | Force extraction domain |
content_filtering | bool | No | true | Filter non-essential content (TOC, legal, boilerplate) from entity extraction. Filtered content remains searchable via RAG. |
filtering_mode | string | No | balanced | Strictness of post-extraction filters. See Filtering Modes. |
skip_duplicates | bool | No | false | Skip files whose content already exists |
Response 202 Accepted
{
"uploaded": 2,
"failed": 0,
"files": [
{ "id": "src_abc123", "filename": "doc1.pdf", "status": "pending", "..." : "..." },
{ "id": "src_def456", "filename": "doc2.pdf", "status": "pending", "..." : "..." }
],
"errors": []
}
Each item in files is a full SourceResponse. When a file fails,
it appears in errors instead:
{
"uploaded": 1,
"failed": 1,
"files": [ { "..." : "..." } ],
"errors": [
{ "filename": "bad.xyz", "error": "Unsupported file type" }
]
}
Returns 400 if the number of files exceeds the configured max_upload_files limit.
Import from URL
POST /api/v1/sources/url
Fetch a web page, extract clean markdown content, and process it through the standard file pipeline.
curl -X POST http://localhost:8080/api/v1/sources/url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article"}'
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | -- | URL to import (must start with http:// or https://) |
extract_entities | bool | No | true | Run entity extraction after indexing |
analysis_depth | string | No | full | Extraction depth: full or quick |
enable_normalization | bool | No | true | Normalize content on upload |
enable_vision | bool | No | auto-detect | Enable vision processing for images in fetched HTML / PDFs |
domain | string | No | null | Force extraction domain |
content_filtering | bool | No | true | Filter non-essential content from entity extraction. Filtered content remains searchable via RAG. |
filtering_mode | string | No | balanced | Strictness of post-extraction filters. See Filtering Modes. |
skip_duplicates | bool | No | false | Skip if identical content exists |
The URL fetcher validates the upstream Content-Type against the same allowlist used for direct file uploads (batching.allowed_content_types). It honors any charset=… parameter in the response header and routes binary responses (PDF, ZIP, DOCX, etc.) to the binary loader path so application/pdf URLs are no longer mishandled as HTML.
Response 202 Accepted -- SourceResponse
The response is identical in shape to the single file upload response. The source_type
will be webpage and origin_url will contain the imported URL.
| Status | Description |
|---|---|
400 | Invalid URL format |
422 | Failed to fetch URL or content shorter than 50 characters |
Source CRUD
List Sources
GET /api/v1/sources
Paginated list of sources with optional filters. Returns PaginatedSourcesResponse containing SourceSummaryResponse items.
curl "http://localhost:8080/api/v1/sources?status=committed"
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
page | int | No | 1 | Page number (1-indexed) |
page_size | int | No | Server default (50) | Items per page (capped at max_page_size) |
source_type | string | No | null | Filter by source type (pdf, text, csv, webpage, etc.) |
status | string | No | null | Filter by processing status (pending, indexing, indexed, extracting, extracted, committing, committed, error) |
enabled | string | No | null | Filter by enabled state: enabled or disabled |
search | string | No | null | Search in title and origin URL |
tag_id | string | No | null | Filter by tag ID |
Response 200 OK -- PaginatedSourcesResponse
{
"data": [
{
"id": "src_abc123",
"filename": "research-paper.pdf",
"file_type": "pdf",
"file_size": 204800,
"title": "A Research Paper",
"status": "committed",
"chunk_count": 42,
"extraction_entities_count": 85,
"extraction_relationships_count": 120,
"cached_quality_grade": "A",
"tags": [
{ "id": "tag_001", "name": "Research", "color": "#4dabf5" }
],
"created_at": "2026-03-09T10:00:00"
}
],
"pagination": {
"total": 1,
"page": 1,
"page_size": 20,
"total_pages": 1,
"has_next": false,
"has_prev": false
}
}
Each item is a SourceSummaryResponse with additional fields including embedding info, LLM metrics, duration timings, and quality scores.
Get Source
GET /api/v1/sources/{source_id}
Returns the full source detail including all lifecycle fields, LLM metrics, and user metadata.
curl http://localhost:8080/api/v1/sources/src_abc123
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 200 OK -- SourceResponse
See the Upload Single File section for a full response example.
| Status | Description |
|---|---|
404 | Source not found |
Update Source
PATCH /api/v1/sources/{source_id}
Update mutable source fields.
curl -X PATCH http://localhost:8080/api/v1/sources/src_abc123 \
-H "Content-Type: application/json" \
-d '{
"title": "Updated Title",
"enabled": true,
"user_metadata": { "category": "research" }
}'
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
title | string | No | New display title |
processing_status | string | No | Override status (ready or error) |
enabled | bool | No | Enable or disable the source |
user_metadata | object | No | Arbitrary key-value metadata |
Response 200 OK -- SourceResponse
| Status | Description |
|---|---|
404 | Source not found |
Delete Source
DELETE /api/v1/sources/{source_id}
Permanently deletes the source and cascades to all chunks, citations, graph nodes, edges, templates, and search index entries.
curl -X DELETE http://localhost:8080/api/v1/sources/src_abc123
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 204 No Content
| Status | Description |
|---|---|
404 | Source not found |
Source Metadata
List Extraction Domains
GET /api/v1/sources/domains
Returns available extraction domains for dropdown selection. Includes built-in domains and any per-database custom domains.
curl http://localhost:8080/api/v1/sources/domains
Response 200 OK
{
"domains": [
{
"name": "generic",
"description": "General-purpose entity extraction",
"builtin": true,
"extraction_density": "medium",
"prompt_tokens": 1200
},
{
"name": "technical",
"description": "Technical documentation and specifications",
"builtin": true,
"extraction_density": "high",
"prompt_tokens": 1800
}
]
}
Get Processing Stats
GET /api/v1/sources/stats
Aggregate processing statistics across all sources.
curl http://localhost:8080/api/v1/sources/stats
Response 200 OK
{
"total_files": 25,
"by_status": {
"committed": 20,
"indexed": 3,
"error": 2
},
"total_chunks": 1042,
"total_entities": 850,
"total_relationships": 1200
}
Extraction Management
Trigger Extraction
POST /api/v1/sources/{source_id}/extraction
Trigger manual entity extraction for a source. The source must be in indexed or
extracted status. Returns 202 Accepted while extraction runs in the background.
curl -X POST http://localhost:8080/api/v1/sources/src_abc123/extraction \
-H "Content-Type: application/json" \
-d '{
"analysis_depth": "full",
"domain": "technical",
"force": false
}'
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source_id | string (path) | Yes | -- | Source ID |
analysis_depth | string | No | full | Extraction depth: full or quick |
domain | string | No | null | Force extraction domain. Auto-detected if omitted. |
filtering_mode | string | No | persisted | Override the source's persisted filtering_mode for this run only. |
force | bool | No | false | Re-extract even if extraction results already exist |
The endpoint reuses the source's persisted upload settings (filtering_mode, enable_vision, content_filtering) by default. Pass them in the body to override per-call without changing the row.
Response 202 Accepted
{
"source_id": "src_abc123",
"job_id": "job_xyz789",
"status": "queued",
"message": "Extraction started"
}
| Status | Description |
|---|---|
400 | Source is not in an extractable state |
404 | Source not found |
409 | Extraction already in progress (use force=true to re-extract) |
Get Extraction Progress
GET /api/v1/sources/{source_id}/extraction
Returns detailed extraction progress including job status, chunk-level counts, and timing estimates.
curl http://localhost:8080/api/v1/sources/src_abc123/extraction
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 200 OK
{
"source_id": "src_abc123",
"job_id": "job_xyz789",
"status": "running",
"has_extraction_job": true,
"total_chunks": 10,
"completed_chunks": 6,
"failed_chunks": 0,
"progress_percent": 60.0,
"chunks_by_status": {
"completed": 6,
"running": 1,
"queued": 3
},
"total_entities": 52,
"total_relationships": 78,
"extraction_depth": "full",
"started_at": "2026-03-09T10:00:00",
"completed_at": null,
"timing": {
"avg_duration_ms": 4200,
"min_duration_ms": 2100,
"max_duration_ms": 6800
},
"current_chunk": {
"chunk_index": 6,
"status": "running",
"started_at": "2026-03-09T10:02:30"
}
}
When no extraction job exists:
{
"source_id": "src_abc123",
"status": "indexed",
"has_extraction_job": false,
"message": "No active extraction job for this source"
}
| Status | Description |
|---|---|
404 | Source not found |
Cancel Extraction
DELETE /api/v1/sources/{source_id}/extraction
Cancels all pending and queued extraction chunks. Already running or completed
chunks are not affected. Source status reverts to indexed (RAG search still
works).
curl -X DELETE http://localhost:8080/api/v1/sources/src_abc123/extraction
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 204 No Content
| Status | Description |
|---|---|
404 | Source not found or no active extraction job |
Reclassify Source Domain
POST /api/v1/sources/{source_id}/reclassify
Change the extraction domain for a source and queue a new extraction pass.
Returns 202 Accepted while the new extraction runs in the background.
For sources that are already committed, this endpoint atomically resets prior
graph artifacts (nodes, edges, templates) before dispatching so the new
extraction starts clean.
When to use: When auto-detection chose the wrong domain, or when you want to
re-run extraction under a different domain template after reviewing the initial
results. Prefer this over setting domain at upload time — reclassify decouples
domain selection from the upload flow.
curl -X POST http://localhost:8080/api/v1/sources/src_abc123/reclassify \
-H "Content-Type: application/json" \
-d '{"domain": "medical"}'
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
domain | string | Yes | Domain name to use (e.g. medical, legal). See GET /sources/domains. |
Response 202 Accepted
{
"source_id": "src_abc123",
"status": "extracting"
}
| Status | Description |
|---|---|
400 | Source is not in a reclassifiable state (indexed or committed required) |
404 | Source not found |
503 | No LLM provider configured |
List Extraction Tasks
GET /api/v1/sources/{source_id}/extraction/tasks
Paginated list of individual chunk extraction tasks (LLM processing groups). Useful for debugging and analytics.
curl "http://localhost:8080/api/v1/sources/src_abc123/extraction/tasks?page=1&page_size=20"
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source_id | string (path) | Yes | -- | Source ID |
page | int | No | 1 | Page number (1-indexed) |
page_size | int | No | Server default | Items per page |
include_content | bool | No | false | Include full input_text and llm_response_json (large payloads) |
Response 200 OK -- ExtractionTaskListResponse
{
"tasks": [
{
"id": "task_001",
"job_id": "job_xyz789",
"chunk_index": 0,
"hierarchical_group_id": "group_a",
"small_chunk_ids": ["chunk_001", "chunk_002"],
"status": "completed",
"created_at": "2026-03-09T10:00:00",
"queued_at": "2026-03-09T10:00:01",
"started_at": "2026-03-09T10:00:05",
"completed_at": "2026-03-09T10:00:09",
"llm_duration_ms": 3800,
"retry_count": 0,
"entity_count": 8,
"relationship_count": 12,
"invalid_relationship_count": 1,
"small_chunk_numbers": [1, 2],
"input_text_length": 3200,
"llm_response_length": 1800,
"input_tokens": 1100,
"output_tokens": 620,
"context_window_available": 128000,
"input_text": null,
"llm_response_json": null,
"filtering_log": null,
"error_message": null,
"error_type": null
}
],
"total": 10,
"page": 1,
"page_size": 20
}
Set include_content=true to populate input_text and llm_response_json.
By default only their lengths are returned for performance.
Get Extraction Task
GET /api/v1/sources/{source_id}/extraction/tasks/{task_id}
Returns a single extraction task with full details, including content fields.
curl http://localhost:8080/api/v1/sources/src_abc123/extraction/tasks/task_001
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
task_id | string (path) | Yes | Extraction task ID |
Response 200 OK -- ExtractionTaskResponse
Same shape as items in the task list, but with input_text, llm_response_json,
and filtering_log fully populated.
| Status | Description |
|---|---|
404 | Extraction task not found |
Get Extraction Task Stats
GET /api/v1/sources/{source_id}/extraction/stats
Aggregate statistics (min/avg/max) for extraction tasks, computed via SQL aggregates without loading every row.
curl http://localhost:8080/api/v1/sources/src_abc123/extraction/stats
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 200 OK -- ExtractionTaskStatsResponse
{
"total_tasks": 10,
"context_window": 128000,
"min_input_tokens": 800,
"max_input_tokens": 2400,
"avg_input_tokens": 1500,
"min_output_tokens": 300,
"max_output_tokens": 900,
"avg_output_tokens": 600,
"min_total_tokens": 1100,
"max_total_tokens": 3300,
"avg_total_tokens": 2100,
"min_utilization": 0.86,
"max_utilization": 2.58,
"avg_utilization": 1.64,
"min_duration_ms": 2100,
"max_duration_ms": 6800,
"avg_duration_ms": 4200,
"total_entities": 85,
"avg_entities_per_task": 8.5,
"total_relationships": 120,
"avg_relationships_per_task": 12.0,
"total_retries": 2,
"max_retries_single_task": 1,
"total_invalid_relationships": 5,
"avg_invalid_per_task": 0.5,
"total_entities_filtered": 3,
"total_relationships_filtered": 7,
"filtering_stage_summary": [
{ "stage": "exact_dedup", "total_removed": 2, "chunk_count": 2 },
{ "stage": "relationship_dedup", "total_removed": 5, "chunk_count": 3 }
],
"system_prompt": "You are an entity extraction assistant...",
"extraction_rules_template": "...",
"entity_templates": "...",
"relationship_templates": "...",
"domain_guidance": "...",
"domain_examples": "..."
}
| Status | Description |
|---|---|
404 | No extraction statistics available for this source |
Get Extraction Chart Data
GET /api/v1/sources/{source_id}/extraction/charts
Returns all extraction tasks with minimal fields for UI chart rendering. No pagination -- returns all tasks at once for efficient charting.
curl http://localhost:8080/api/v1/sources/src_abc123/extraction/charts
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 200 OK
[
{
"chunk_index": 0,
"status": "completed",
"retry_count": 0,
"entity_count": 8,
"relationship_count": 12,
"input_text_length": 3200,
"llm_duration_ms": 3800
},
{
"chunk_index": 1,
"status": "completed",
"retry_count": 1,
"entity_count": 6,
"relationship_count": 9,
"input_text_length": 2800,
"llm_duration_ms": 5200
}
]
Get Cross-Chunk Filtering Log
GET /api/v1/sources/{source_id}/extraction/filteringlog
Returns the cross-chunk deduplication filtering log from the post-extraction merging stage. Shows entities and relationships removed during structural filtering, exact/semantic deduplication, and relationship deduplication.
curl http://localhost:8080/api/v1/sources/src_abc123/extraction/filteringlog
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 200 OK
{
"stages": [
{
"stage": "exact_dedup",
"removed_count": 3,
"details": ["Entity 'Python' duplicate removed", "..."]
}
],
"total_removed": 5
}
| Status | Description |
|---|---|
404 | Source not found or no filtering log available |
Retry Errored Source
POST /api/v1/sources/{source_id}/retry
Manually retry a source that is in error status. The retry target is determined
by error_stage on the source record — the service routes the source back to the
appropriate pipeline stage (indexing, extraction, or commit).
curl -X POST http://localhost:8080/api/v1/sources/src_abc123/retry
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 200 OK -- SourceResponse
Returns the updated source record after the retry has been queued.
| Status | Description |
|---|---|
404 | Source not found |
409 | Source is not in error status or error_stage is unknown |
Re-extract Source
POST /api/v1/sources/{source_id}/re_extract
Manually re-run entity extraction on a source (distinct from retry). The key difference:
- Retry preserves the cached extraction payload and re-runs only the failed stage (cheap — no additional LLM tokens for commit-only retries).
- Re-extract discards the cached payload and any previous extraction results,
resets the source to
indexed, and re-runs the full LLM extraction (expensive — costs LLM tokens).
Use this when you want to re-analyze a document after changing the extraction domain, fixing domain-specific rules, or correcting the initial extraction output.
curl -X POST http://localhost:8080/api/v1/sources/src_abc123/re_extract
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 202 Accepted -- SourceResponse
Returns the updated source record after the re-extraction job has been queued.
Allowed source states:
committed— Atomically deletes graph artifacts (nodes, edges, templates), resets toindexed, and re-extracts.error(after post-INDEXING stage) — Resets toindexed, clears cached payload, and re-extracts.indexed/extracted/extracting/mcp_extracting/committing— Forcibly resets toindexed, clears payload, and re-extracts.
Rejected source states:
pending/indexing— Returns422; the source has not yet produced extraction artifacts. Wait for indexing to complete, then retry.
Re-extract reuses the source's persisted upload settings (auto_analyze, enable_normalization, enable_vision, content_filtering, filtering_mode) by default — what you uploaded with is what you re-extract with. Clients can override any of these per call by passing them in the request body.
force_re_extract also resets every quality counter on the source row back to zero and clears vector_indexing_status to pending, so the new run starts with a clean counter set.
| Status | Description |
|---|---|
404 | Source not found |
422 | Source is in pending or indexing state (not yet indexable) |
List Source Recovery Events
GET /api/v1/sources/{source_id}/recovery_events
Returns the recovery audit trail for a source — every automatic recovery attempt, what was dispatched, and when. Backs the source detail page's recovery panel so operators can diagnose repeated failures without grepping container logs. Events are returned newest first.
curl "http://localhost:8080/api/v1/sources/src_abc123/recovery_events?limit=20"
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source_id | string (path) | Yes | -- | Source ID |
limit | int (query) | No | 50 | Maximum events to return (1--200) |
Recovery events use a ?limit= cap (max 200) rather than the standard ?page=&page_size= pagination model — events are an audit trail, not a paged resource collection.
Response 200 OK
{
"events": [
{
"id": "rev_001",
"source_id": "src_abc123",
"event_type": "recovery",
"created_at": "2026-03-09T10:05:00",
"reason": "Stalled extraction detected",
"dispatched_operation": "OP_EXTRACT_SOURCE"
}
]
}
| Status | Description |
|---|---|
404 | Source not found |
Cleanup Orphan Chunk Tasks
POST /api/v1/sources/cleanup/orphan_tasks
Triggers an immediate sweep of orphaned chunk tasks — tasks whose parent extraction job completed or failed but whose rows were not updated. Normally run automatically on a schedule; use this endpoint to trigger it on demand after bulk operations or during recovery.
curl -X POST http://localhost:8080/api/v1/sources/cleanup/orphan_tasks
Response 200 OK
{
"deleted_count": 12,
"retention_days": 7
}
| Field | Type | Description |
|---|---|---|
deleted_count | int | Number of orphaned task rows removed |
retention_days | int | Configured orphan retention window (from SourceRecoverySettings) |
Abort All Processing
DELETE /api/v1/sources/{source_id}/processing
Cancels all queued/running tasks (indexing or extraction) and resets the source status appropriately.
curl -X DELETE http://localhost:8080/api/v1/sources/src_abc123/processing
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 204 No Content
| Status | Description |
|---|---|
400 | Source is not in a processing state |
404 | Source not found |
pending/indexing->error(with message "Processing/Indexing aborted by user")extracting->indexed(RAG still usable)committing->extracted
Chunks
List Chunks
GET /api/v1/sources/{source_id}/chunks
Paginated list of document chunks for a source.
curl "http://localhost:8080/api/v1/sources/src_abc123/chunks?page=1&page_size=20"
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source_id | string (path) | Yes | -- | Source ID |
page | int | No | 1 | Page number (1-indexed) |
page_size | int | No | Server default | Items per page |
status | string | No | null | Filter by chunk status |
Response 200 OK -- ChunkListResponse
{
"chunks": [
{
"id": "chunk_001",
"source_id": "src_abc123",
"chunk_index": 0,
"content": "This is the first chunk of the document...",
"page_number": 1,
"section": "Introduction",
"group_index": 0,
"status": "indexed",
"created_at": "2026-03-09T10:00:05"
}
],
"total": 42,
"page": 1,
"page_size": 20
}
Get Chunk
GET /api/v1/sources/{source_id}/chunks/{chunk_id}
Returns a single chunk by ID.
curl http://localhost:8080/api/v1/sources/src_abc123/chunks/chunk_001
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
chunk_id | string (path) | Yes | Chunk ID |
Response 200 OK -- ChunkResponse
{
"id": "chunk_001",
"source_id": "src_abc123",
"chunk_index": 0,
"content": "This is the first chunk of the document...",
"page_number": 1,
"section": "Introduction",
"group_index": 0,
"status": "indexed",
"created_at": "2026-03-09T10:00:05"
}
| Status | Description |
|---|---|
404 | Chunk not found or does not belong to this source |
Citations
List Citations
GET /api/v1/sources/{source_id}/citations
Paginated list of entity citations (attributions) for a source. Each citation links an extracted entity back to the source chunk it was found in.
curl "http://localhost:8080/api/v1/sources/src_abc123/citations?page=1&page_size=20"
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source_id | string (path) | Yes | -- | Source ID |
page | int | No | 1 | Page number (1-indexed) |
page_size | int | No | Server default | Items per page |
Response 200 OK -- CitationListResponse
{
"citations": [
{
"id": "cit_001",
"entity_uri": "urn:chaoscypher:node:abc123",
"entity_label": "Python",
"entity_type": "Programming Language",
"source_id": "src_abc123",
"chunk_id": "chunk_001",
"confidence": 0.95,
"extraction_method": "llm",
"context_snippet": "...Python is a versatile programming language...",
"created_at": "2026-03-09T10:05:00"
}
],
"total": 85,
"page": 1,
"page_size": 20
}
Source Data Access
Get Source Stats
GET /api/v1/sources/{source_id}/stats
Returns computed statistics for a single source.
curl http://localhost:8080/api/v1/sources/src_abc123/stats
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 200 OK
{
"chunk_count": 42,
"citation_count": 85,
"entity_count": 65,
"relationship_count": 120,
"total_content_length": 52000,
"avg_chunk_length": 1238
}
| Status | Description |
|---|---|
404 | Source not found |
Get Source Entities
GET /api/v1/sources/{source_id}/entities
Paginated list of entities extracted from the document. Each entity includes
a computed quality_score (0-100).
curl "http://localhost:8080/api/v1/sources/src_abc123/entities?page=1&page_size=20&sort_by=quality&sort_order=desc"
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source_id | string (path) | Yes | -- | Source ID |
page | int | No | 1 | Page number (1-indexed) |
page_size | int | No | Server default | Items per page |
sort_by | string | No | default | Sort field: default, quality, confidence, name, type |
sort_order | string | No | desc | Sort direction: asc or desc |
Response 200 OK
{
"entities": [
{
"name": "Python",
"type": "ProgrammingLanguage",
"confidence": 0.95,
"description": "A versatile programming language",
"source_chunks": [0, 3, 7],
"quality_score": 92.5
}
],
"pagination": {
"page": 1,
"page_size": 20,
"total": 85,
"total_pages": 5,
"has_next": true,
"has_prev": false
}
}
| Status | Description |
|---|---|
404 | Source not found |
Get Source Relationships
GET /api/v1/sources/{source_id}/relationships
Paginated list of relationships extracted from the document. Each relationship
is enriched with human-readable from and to entity names.
curl "http://localhost:8080/api/v1/sources/src_abc123/relationships?page=1&page_size=20"
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source_id | string (path) | Yes | -- | Source ID |
page | int | No | 1 | Page number (1-indexed) |
page_size | int | No | Server default | Items per page |
Response 200 OK
{
"relationships": [
{
"source": 0,
"target": 5,
"type": "USES",
"confidence": 0.88,
"from": "FastAPI",
"to": "Python"
}
],
"pagination": {
"page": 1,
"page_size": 20,
"total": 120,
"total_pages": 6,
"has_next": true,
"has_prev": false
}
}
| Status | Description |
|---|---|
404 | Source not found |
Get Source Templates
GET /api/v1/sources/{source_id}/templates
Paginated list of graph templates created from extraction of this source.
curl "http://localhost:8080/api/v1/sources/src_abc123/templates?page=1&page_size=20&template_type=node"
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source_id | string (path) | Yes | -- | Source ID |
template_type | string | No | null | Filter by type: node or edge |
page | int | No | 1 | Page number (1-indexed) |
page_size | int | No | Server default | Items per page |
Response 200 OK
{
"templates": [
{
"id": "template_abc123",
"name": "ProgrammingLanguage",
"type": "node",
"source_id": "src_abc123",
"properties": ["name", "paradigm", "version"]
}
],
"pagination": {
"page": 1,
"page_size": 20,
"total": 12,
"total_pages": 1,
"has_next": false,
"has_prev": false
}
}
| Status | Description |
|---|---|
404 | Source not found |
Get Source LLM Metrics
GET /api/v1/sources/{source_id}/llm_metrics
Summary of LLM usage metrics for a source, including call counts, token consumption, cost estimates, and derived rates.
curl http://localhost:8080/api/v1/sources/src_abc123/llm_metrics
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 200 OK
{
"source_id": "src_abc123",
"has_metrics": true,
"summary": {
"total_calls": 6,
"successful_calls": 6,
"failed_calls": 0,
"retry_calls": 1,
"first_try_successes": 5,
"retry_successes": 1,
"permanent_failures": 0,
"total_input_tokens": 24000,
"total_output_tokens": 8500,
"wasted_tokens": 400,
"avg_call_duration_ms": 4200,
"total_duration_ms": 25200,
"estimated_cost_usd": 0.0325,
"error_counts": {},
"model": "gpt-4o",
"success_rate": 1.0,
"retry_rate": 0.167,
"waste_percentage": 0.012
}
}
| Status | Description |
|---|---|
404 | Source not found |
List Source LLM Calls
GET /api/v1/sources/{source_id}/llm_metrics/calls
Paginated list of individual LLM API calls made during extraction of this source.
curl "http://localhost:8080/api/v1/sources/src_abc123/llm_metrics/calls?page=1&page_size=20&success=true"
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source_id | string (path) | Yes | -- | Source ID |
page | int | No | 1 | Page number (1-indexed) |
page_size | int | No | Server default | Items per page |
success | bool | No | null | Filter by success status |
Response 200 OK
{
"calls": [
{
"id": "call_001",
"source_id": "src_abc123",
"success": true,
"input_tokens": 1100,
"output_tokens": 620,
"duration_ms": 3800,
"model": "gpt-4o",
"created_at": "2026-03-09T10:00:05"
}
],
"pagination": {
"page": 1,
"page_size": 20,
"total": 6,
"total_pages": 1,
"has_next": false,
"has_prev": false
}
}
| Status | Description |
|---|---|
404 | Source not found |
Tags
Tag endpoints are mounted at /api/v1/sources/tags for tag CRUD, and nested
under individual sources for tag assignment.
List All Tags
GET /api/v1/sources/tags
Returns all tags in the current database.
curl http://localhost:8080/api/v1/sources/tags
Response 200 OK -- list[TagResponse]
[
{
"id": "tag_001",
"database_name": "default",
"name": "Research",
"color": "#4dabf5",
"description": "Research papers",
"created_at": "2026-03-01T08:00:00"
},
{
"id": "tag_002",
"database_name": "default",
"name": "Technical",
"color": "#66bb6a",
"description": null,
"created_at": "2026-03-02T09:00:00"
}
]
Get Tag
GET /api/v1/sources/tags/{tag_id}
Returns a single tag by ID.
curl http://localhost:8080/api/v1/sources/tags/tag_001
| Parameter | Type | Required | Description |
|---|---|---|---|
tag_id | string (path) | Yes | Tag ID |
Response 200 OK -- TagResponse
{
"id": "tag_001",
"database_name": "default",
"name": "Research",
"color": "#4dabf5",
"description": "Research papers",
"created_at": "2026-03-01T08:00:00"
}
| Status | Description |
|---|---|
404 | Tag not found |
Create Tag
POST /api/v1/sources/tags
Create a new tag.
curl -X POST http://localhost:8080/api/v1/sources/tags \
-H "Content-Type: application/json" \
-d '{
"name": "Research",
"color": "#4dabf5",
"description": "Research papers"
}'
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
name | string | Yes | -- | Tag display name |
color | string | No | null | Hex color code (e.g. #4dabf5) |
description | string | No | null | Tag description |
Response 201 Created -- TagResponse
| Status | Description |
|---|---|
400 | Duplicate tag name or validation error |
Update Tag
PATCH /api/v1/sources/tags/{tag_id}
Update tag properties. All fields are optional.
curl -X PATCH http://localhost:8080/api/v1/sources/tags/tag_001 \
-H "Content-Type: application/json" \
-d '{ "name": "Updated Name", "color": "#ff5722" }'
| Parameter | Type | Required | Description |
|---|---|---|---|
tag_id | string (path) | Yes | Tag ID |
name | string | No | Updated tag name |
color | string | No | Updated hex color |
description | string | No | Updated description |
Response 200 OK -- TagResponse
| Status | Description |
|---|---|
400 | Duplicate tag name or validation error |
404 | Tag not found |
Delete Tag
DELETE /api/v1/sources/tags/{tag_id}
Delete a tag. Removes the tag and all source-tag associations.
curl -X DELETE http://localhost:8080/api/v1/sources/tags/tag_001
| Parameter | Type | Required | Description |
|---|---|---|---|
tag_id | string (path) | Yes | Tag ID |
Response 204 No Content
| Status | Description |
|---|---|
404 | Tag not found |
Source Tag Assignment
List Tags for Source
GET /api/v1/sources/{source_id}/tags
Returns all tags assigned to a specific source.
curl http://localhost:8080/api/v1/sources/src_abc123/tags
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 200 OK -- list[TagResponse]
[
{
"id": "tag_001",
"database_name": "default",
"name": "Research",
"color": "#4dabf5",
"description": "Research papers",
"created_at": "2026-03-01T08:00:00"
}
]
Assign Tag to Source
POST /api/v1/sources/{source_id}/tags/{tag_id}
Assign a tag to a source.
curl -X POST http://localhost:8080/api/v1/sources/src_abc123/tags/tag_001
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
tag_id | string (path) | Yes | Tag ID |
Response 204 No Content
| Status | Description |
|---|---|
404 | Source or tag not found |
Remove Tag from Source
DELETE /api/v1/sources/{source_id}/tags/{tag_id}
Remove a tag from a source.
curl -X DELETE http://localhost:8080/api/v1/sources/src_abc123/tags/tag_001
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
tag_id | string (path) | Yes | Tag ID |
Response 204 No Content
| Status | Description |
|---|---|
404 | Tag assignment not found |
Page Images
Rendered page images are generated for PDF sources when vision processing is
enabled. Images are stored per-source under
data/databases/{db_name}/images/{source_id}/ and served directly by the API.
List Source Images
GET /api/v1/sources/{source_id}/images
Returns a list of available rendered page images for a source document. Returns an empty list if no images have been generated.
curl http://localhost:8080/api/v1/sources/src_abc123/images
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
Response 200 OK -- list[object]
[
{
"filename": "page_1.png",
"url": "/sources/src_abc123/images/page_1.png"
},
{
"filename": "page_2.png",
"url": "/sources/src_abc123/images/page_2.png"
}
]
Images are sorted by filename. The url field is the path to pass to the
Get Source Image endpoint.
Get Source Image
GET /api/v1/sources/{source_id}/images/{filename}
Serve a specific rendered page image. Returns the image as image/png.
Path traversal is prevented -- the resolved path must remain within the
source image directory.
curl http://localhost:8080/api/v1/sources/src_abc123/images/page_1.png \
--output page_1.png
| Parameter | Type | Required | Description |
|---|---|---|---|
source_id | string (path) | Yes | Source ID |
filename | string (path) | Yes | Image filename (e.g. page_1.png) |
Response 200 OK -- PNG image binary (Content-Type: image/png)
| Status | Description |
|---|---|
403 | Path traversal attempt detected |
404 | Image file not found |
Response Models
SourceResponse
Full source detail model returned by get, create, and update endpoints. Contains all lifecycle fields across indexing, extraction, commit, and LLM metrics stages.
| Field | Type | Description |
|---|---|---|
id | string | Source ID |
database_name | string | Database this source belongs to |
filename | string | Original filename |
filepath | string? | Storage path |
file_type | string? | File type (e.g. pdf, csv) |
file_size | int? | File size in bytes |
title | string? | Display title |
source_type | string? | Source type (e.g. pdf, webpage) |
origin_url | string? | Original URL for web imports |
version | int | Version number (default 1) |
parent_id | string? | Parent source ID |
status | string | Lifecycle status: pending | indexing | indexed | extracting | extracted | committing | committed | error |
enabled | bool | Whether the source is active |
error_message | string? | Error description if status is error |
error_stage | string? | Stage where error occurred |
chunk_count | int | Number of chunks created |
total_content_length | int | Total character count of all chunks |
embedding_model | string? | Embedding model used |
embedding_dimensions | int? | Embedding vector dimensions |
indexing_started_at | datetime? | When indexing started |
indexing_completed_at | datetime? | When indexing finished |
indexing_duration_seconds | float? | Calculated indexing duration |
extraction_depth | string? | Extraction depth (full or quick) |
extraction_entities_count | int | Entities extracted |
extraction_relationships_count | int | Relationships extracted |
extraction_domain | string? | Domain used for extraction |
extraction_domain_auto | bool | Whether domain was auto-detected |
extraction_started_at | datetime? | When extraction started |
extraction_completed_at | datetime? | When extraction finished |
extraction_duration_seconds | float? | Calculated extraction duration |
current_extraction_job_id | string? | Active extraction job ID |
commit_started_at | datetime? | When commit started |
commit_completed_at | datetime? | When commit finished |
commit_duration_seconds | float? | Calculated commit duration |
commit_nodes_created | int | Graph nodes committed |
commit_edges_created | int | Graph edges committed |
commit_templates_created | int | Templates created |
current_step | int? | Current processing step number |
total_steps | int? | Total processing steps |
step_description | string? | Current step label |
llm_total_calls | int | Total LLM API calls |
llm_successful_calls | int | Successful LLM calls |
llm_failed_calls | int | Failed LLM calls |
llm_retry_calls | int | Retried LLM calls |
llm_first_try_successes | int | Calls that succeeded on first attempt |
llm_retry_successes | int | Calls that succeeded after retry |
llm_permanent_failures | int | Calls that permanently failed |
llm_total_input_tokens | int | Total input tokens consumed |
llm_total_output_tokens | int | Total output tokens generated |
llm_wasted_tokens | int | Tokens wasted on failed calls |
llm_avg_call_duration_ms | int? | Average call duration in ms |
llm_total_duration_ms | int | Total LLM call duration in ms |
llm_estimated_cost_usd | float? | Estimated cost in USD |
llm_error_counts | object? | Error type breakdown |
llm_model | string? | LLM model used |
created_at | datetime | Creation timestamp |
updated_at | datetime | Last update timestamp |
user_metadata | object? | User-defined metadata |
upload_options | object | Persisted upload settings — see UploadOptions |
quality_metrics | object | Per-stage quality counters and loader/search status — see QualityMetrics. Full reference: Quality Metrics API. |
vector_indexing_status | string | One of pending, indexed, degraded, failed. Mirrored at the top level for convenience; also lives inside quality_metrics. See Search Status. |