📚 Sage Knowledge Base Guide

This document focuses on app/server/services/knowledge_base, including the current architecture, indexing/retrieval flow, and extension points.

1. Module role

The knowledge base module currently handles three responsibilities:

Document sync: parse files, remove old records, and rebuild index data.
Retrieval: generate embeddings and run hybrid retrieval.
Agent integration: expose KB search via built-in MCP tool.

In the current version, Elasticsearch is the default vector store and full-text engine.

2. Folder layout

app/server/services/knowledge_base/
├── knowledge_base.py                 # DocumentService: external service entry
├── adapter/
│   ├── es_vector_store.py            # EsVectorStore: ES implementation of VectorStore
│   └── server_embedding_adapter.py   # Embedding adapter
└── parser/
    ├── base.py                       # Parser base class
    ├── common_parser.py              # Common file parser
    ├── qa_parser.py                  # QA parser
    └── eml_parser.py                 # Email/attachment parser

Core classes:

DocumentService: unified service interface for sync/insert/delete/search.
KnowledgeManager: retrieval-engine orchestrator (split/embed/store).
EsVectorStore: ES-backed VectorStore implementation.
ServerEmbeddingAdapter: server-side embedding adapter.

3. Ingestion flow

DocumentService.sync_document(...) executes full sync:

Pick parser by data_source.
Run clear_old to collect old document IDs.
Run process to build DocumentInput list.
Call KnowledgeManager.add_documents(...) for split/embed/store.

flowchart LR
    A[DocumentService.sync_document] --> B[Parser.clear_old/process]
    B --> C[DocumentService.doc_document_insert]
    C --> D[KnowledgeManager.add_documents]
    D --> E[DefaultSplitter split]
    E --> F[EmbeddingModel embed_documents]
    F --> G[EsVectorStore.add_documents]
    G --> H[(ES: index_doc/index_doc_full)]

4. Retrieval flow (current)

DocumentService.doc_search(index_name, question, query_size) path:

KnowledgeManager.search(...) generates query embedding.
EsVectorStore.search(...) runs in parallel:
- Vector KNN search (emb)
- BM25 search (doc_content)
SearchResultPostProcessTool performs RRF fusion + overlap merge.
The service enriches chunk results with full-doc fields.

flowchart LR
    A[doc_search] --> B[embed_query]
    B --> C[EsVectorStore.search]
    C --> D1[Vector KNN]
    C --> D2[BM25]
    D1 --> E[RRF Fusion]
    D2 --> E
    E --> F[Merge Overlap]
    F --> G[get_documents_by_ids]
    G --> H[search_results]

5. ES index model

Each KB index uses two ES indices:

{index_name}_doc: chunk-level index for retrieval.
{index_name}_doc_full: full-document index for enrichment.

_doc key fields:

doc_id, doc_segment_id
doc_content (BM25)
emb (dense vector)
metadata fields (start/end/path/title/main_doc_id)

_doc_full key fields:

doc_id
full_content
origin_content/path/title/metadata

6. How Agent calls KB

KB retrieval is exposed by built-in MCP tool retrieve_on_zavixai_db:

Tool function is defined in app/server/routers/kdb.py.
If request contains available_knowledge_bases, the system:
- adds retrieve_on_zavixai_db into available_tools.
- injects KB index_name into system_context.
Model issues a tool call.
Tool calls DocumentService.doc_search(...).
Result is written back as role=tool message for next-round reasoning.

7. Current boundaries

Backend is hard-bound to ES in DocumentService.
VectorStore.search has no scalar filter argument yet.
Scheduler initialization depends on es_url.

8. Troubleshooting checklist

No retrieval result:
- Check ES client initialization.
- Check {index_name}_doc exists and has records.
Uploaded but not searchable:
- Verify parser routing by data_source.
- Verify full sync flow completed.
Agent does not call KB:
- Verify available_knowledge_bases exists in agent config.
- Verify retrieve_on_zavixai_db appears in available_tools.