Explainable AI Pipelines: Every Answer Cites Its Sources

We were demoing a clinical decision support tool to a room of physicians when the chief medical officer asked the question we'd been designing for: "How do I know this isn't hallucinated?"

We clicked the citation link next to the AI's recommendation. It opened the source document -- a specific paragraph from a treatment protocol PDF, with the relevance score displayed. The CMO nodded. That's the moment explainability stopped being a feature and became the architecture.

The problem

Most AI systems are black boxes by default. An LLM generates a response, and the user either trusts it or doesn't. In consumer applications, this is tolerable. In healthcare, finance, legal, and enterprise contexts, it's disqualifying.

The requirements we hear from every enterprise client:

Citation -- Every claim in the response must trace back to a source document
Confidence scoring -- The system must indicate how certain it is
Audit trail -- Every query, every retrieval, every generation must be logged
Reproducibility -- The same query with the same documents should produce a consistent response
Source transparency -- Users must be able to inspect the retrieved documents directly

RAG (Retrieval-Augmented Generation) gives you the architecture to meet all five. But only if you build it deliberately.

Our architecture

Documents (PDF, DOCX, structured data)
    |
    v
Chunking + Embedding (LlamaIndex)
    |
    v
Vector Storage (Qdrant, self-hosted)
    |
    v
Query API (FastAPI)
    |-- semantic search with metadata filtering
    |-- LLM generation with source grounding
    |-- citation extraction + confidence scoring
    v
Response with citations, scores, and source links

Every component is open source. The entire pipeline runs on the client's infrastructure. No data leaves their network.

LlamaIndex for orchestration

We use LlamaIndex to manage the document-to-index-to-query pipeline. It handles chunking strategies, embedding generation, and the retrieval-generation bridge.

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core.node_parser import SentenceSplitter

# Chunk documents into ~512 token segments with 50 token overlap
parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)

# Store in self-hosted Qdrant
vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name="clinical_protocols",
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    transformations=[parser],
)

# Query with citations
query_engine = index.as_query_engine(
    response_mode="tree_summarize",
    similarity_top_k=8,
)

The chunk size and overlap values aren't arbitrary. We tested 256, 512, and 1024 token chunks on our clinical document corpus. 512 with 50 overlap gave the best balance of retrieval precision (relevant chunks found) and answer quality (enough context per chunk for coherent generation). Smaller chunks fragmented treatment protocols mid-sentence; larger chunks diluted relevance scores.

FastAPI for the citation layer

The API doesn't just return an answer. It returns the evidence.

@app.post("/query")
async def query_with_citations(request: QueryRequest):
    response = query_engine.query(request.question)

    citations = []
    for i, node in enumerate(response.source_nodes):
        citations.append({
            "index": i + 1,
            "text": node.text[:500],
            "document": node.metadata.get("file_name"),
            "page": node.metadata.get("page_number"),
            "relevance_score": round(node.score, 4),
        })

    return {
        "answer": str(response),
        "citations": citations,
        "query_id": str(uuid4()),  # for audit trail
        "model": "gpt-4",
        "timestamp": datetime.utcnow().isoformat(),
    }

Every response includes the source text, document name, page number, and a relevance score between 0 and 1. The query_id links to a full audit record in PostgreSQL -- the original query, all retrieved chunks, the generated prompt, the raw LLM response, and the final formatted output.

What we learned in production

Running this for a 40,000-document clinical corpus taught us things the tutorials don't cover:

Chunk metadata is the retrieval multiplier. Adding document type, department, and date range to chunk metadata and filtering on it during retrieval cut irrelevant results by 60%. Pure semantic search alone isn't enough for domain-specific corpora.
Relevance score thresholds prevent hallucination. If no retrieved chunk scores above 0.72, we return "insufficient evidence" instead of generating an answer. This threshold was calibrated against physician-reviewed test queries over 3 weeks.
Re-ranking matters. Initial retrieval with HNSW gets you candidates. A cross-encoder re-ranker (we use a fine-tuned model) reorders them by actual relevance. This step alone improved citation accuracy from 71% to 89%.
Log everything. Every query generates an audit record. When a physician questions an AI recommendation, we can replay the exact retrieval and generation chain. This is not optional for healthcare.

The tradeoffs

Latency. The full pipeline (retrieval + re-ranking + generation + citation extraction) takes 2.5-4 seconds per query. For interactive use, this is acceptable. For batch processing, we run async workers.
Cost. LLM generation is the expensive component. We use GPT-4 for clinical contexts where accuracy is critical. For internal knowledge bases, GPT-4o-mini drops the cost 90% with acceptable quality.
Maintenance. Document corpora change. New protocols are published, old ones are retired. We run a nightly ingestion pipeline that detects new and modified documents, re-chunks and re-embeds them, and updates the index. This is operational overhead that doesn't go away.

Our recommendation

If your AI system operates in a domain where incorrect answers have consequences -- healthcare, legal, financial, compliance -- build the citation architecture from day one. Don't bolt it on later. The retrieval pipeline, confidence thresholds, and audit logging should be foundational, not features.

Use LlamaIndex for the orchestration layer, Qdrant for self-hosted vector storage, and FastAPI for a clean API surface. Run the whole thing on your infrastructure. When the auditor asks "where does this answer come from?" you should be able to show them the exact source paragraph, the relevance score, and the full query trace.