Neural Vault Architecture: Running AI on Sensitive Documents Without Touching the Cloud

A homeopathic doctor in Mumbai has spent 20 years building a knowledge base inside her patient notes. When a new patient presents with recurring joint pain and anxiety, she wants to ask: what treatment worked for a similar case in 2019?

She cannot upload those notes to ChatGPT. Patient data. Privacy law. Full stop.

Neural Vault is Artifact 1 from AI Engineer HQ Cohort 1. It is an offline-first RAG system that runs AI inference entirely on a local machine. No API calls. No cloud storage. No data leaves the device.

This is the architecture behind it.

The Constraint That Shaped Everything

The design constraint was absolute: zero network calls during inference. This ruled out every hosted LLM. It ruled out any embedding model that requires an API key. It ruled out ChromaDB's cloud sync feature.

Everything had to run on a MacBook Air with 8GB of RAM. That constraint is not restrictive. It is a product feature. The constraint is what makes the product trustworthy to lawyers, doctors, and defense professionals.

The Four-Component System

Neural Vault Architecture Diagram

Component 1: Document Processor

Entry point for all documents. Currently handles PDFs via pypdf. The processor does two things: extracts raw text page by page, then chunks it into 1000-character segments with 200-character overlap.

The overlap is not optional. Without it, a sentence that spans a chunk boundary loses half its context when retrieved. The retriever gets a fragment. The model gets a bad answer.

def chunk_text(self, text: str, chunk_size: int = 1000, overlap: int = 200) -> list:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        if chunk.strip():
            chunks.append(chunk.strip())
        start += (chunk_size - overlap)
    return chunks

Version 1 limitation we knew going in: no page tracking. The chunks lose their source page. Version 2 adds page numbers as metadata. Every professional use case needs citations. "Here is the answer" is not enough. "Here is the answer, from page 14 of the intake form" is.

Component 2: Vector Store

Uses all-MiniLM-L6-v2 for embeddings. 100MB. Runs on CPU. Converts any text to a 384-dimensional vector in under 50ms.

Storage is ChromaDB with PersistentClient. The chroma_db/ folder on disk means documents survive app restarts. You process a document once. The embeddings persist. Subsequent sessions do not re-process.

The collection uses cosine similarity search with HNSW indexing. At 10,000 chunks, search returns in under 100ms.

self.client = chromadb.PersistentClient(path="./chroma_db")
self.collection = self.client.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

The same embedding model that indexes documents must be used to embed queries. This is not optional. Different embedding models produce vectors in different spaces. A query embedded with model A against an index built with model B returns garbage.

Component 3: LLM Engine

The most constrained component. The model has to fit in 2GB of RAM, run on CPU without a GPU, and generate useful answers in under 5 seconds per query.

We use Llama-3.2-3B-Instruct at Q4_K_M quantization. 4-bit integers instead of 32-bit floats. Roughly 8x size reduction. A 12GB model becomes 2GB. A MacBook Air can run it.

self.model = Llama(
    model_path=self.model_path,
    n_ctx=4096,
    n_gpu_layers=0,
    verbose=False
)

n_ctx=4096 is the context window. Our 6 retrieved chunks average roughly 1,500 tokens. The question adds another 50. The remaining 2,500 tokens go to the model's response. This fits comfortably.

n_gpu_layers=0 keeps everything on CPU. On Apple Silicon, you can set this to -1 to use Metal acceleration. Inference drops from ~4 seconds to ~1.5 seconds per query on M1 Pro.

Responses stream token by token. The user sees words appear as they are generated. This matters for UX more than raw speed. A 4-second response that streams feels faster than a 2-second response that waits before printing.

Component 4: RAG Pipeline

The orchestrator. Connects the other three components.

def query(self, question: str):
    relevant_chunks = self.vector_store.query(question, n_results=6)
    prompt = self.build_prompt(question, relevant_chunks)
    return self.llm.generate_stream(prompt)

def build_prompt(self, question: str, chunks: list) -> str:
    context = "\n".join(chunks)
    return f"""You are a helpful assistant. Answer the question using only the context provided below.

Context:
{context}

Question: {question}

Answer:"""

The prompt constraint "using only the context provided below" is the hallucination guard. The model is explicitly instructed not to use its training data. It answers from what you gave it, not what it learned. This is what makes RAG trustworthy for sensitive professional documents.

The Architecture Decisions That Were Not Obvious

Why not fine-tuning? Fine-tuning trains the model on your documents and changes its weights. For a privacy tool, this creates a permanent record of your data inside the model. RAG keeps your data in the retrieval layer only. Each query is fresh. Nothing persists in the model's parameters. That is the right tradeoff for this use case.

Why ChromaDB over FAISS? FAISS requires manual index management. ChromaDB handles persistence, metadata filtering, and HNSW indexing automatically. For a desktop product used by professionals who are not ML engineers, ChromaDB's abstractions matter. FAISS would require the user to manage index files manually. That is not a viable product experience.

Why Streamlit? The fastest path to a working desktop UI for Python. One file. No frontend development. The entire UI is app.py. For a cohort running on 2-week artifact cycles, this is the right call. The tradeoff is limited customization. A production version would move to a proper desktop framework. For a portfolio artifact with real users, Streamlit is correct.

Memory Profile During a Query

Here is what happens to RAM when someone asks a question:

Model weights (Llama 3.2 Q4): 2.0 GB, constant
Embedding model (MiniLM): 0.1 GB, constant after first load
KV cache during inference: 0.2-0.4 GB, temporary per query, freed after
ChromaDB index in RAM: 0.1 GB for a typical document set

Peak during query: approximately 2.7 GB. A MacBook Air with 8GB handles this while running a browser, Slack, and VS Code.

Known Limitations in Version 1

Scanned PDFs fail silently. A scanned PDF is an image of text. pypdf cannot read it. The app processes it, chunks empty strings, and returns no results when queried. This is not a crash. It is a silent failure, which is worse. Version 2 adds an OCR detection layer that checks whether extracted text is empty and warns the user before indexing.

No multi-document context. Six chunks are retrieved per query, regardless of how many documents are indexed. If the answer spans two different documents, the retriever may not surface both. Version 2 introduces source-aware retrieval that guarantees representation from multiple documents when the index spans more than one.

No cost. The app is free to run. That is the point. Zero API costs, zero per-query charges. The whole value proposition depends on this remaining true.

What 50 People Learn by Building This

Every AI Engineer HQ Cohort 1 member built Neural Vault as Artifact 1. By the end of week 2, all 50 members could explain quantization, build a RAG pipeline from scratch, describe why HNSW indexing is necessary, and reason about the memory tradeoffs of different context window sizes.

That is not theory. Those are the exact questions engineering interviews ask.

If you are building AI systems in production and want to be part of the next cohort,

AI Engineer HQ Cohort 2 applications are open.