I Indexed 468 Files Into a Local Vector Database. Now My Laptop Answers Questions About My Business.

The Problem With Having Too Many Files

I have 468 files that define how my businesses operate. Skill files that tell AI how to connect to WordPress sites. Session transcripts from hundreds of Cowork conversations. Notion exports. API documentation. Configuration files. Project briefs. Meeting notes. Operational playbooks.

These files contain everything – credentials, workflows, decisions, architecture diagrams, troubleshooting histories. The knowledge is comprehensive. The problem is retrieval. When I need to remember how I configured the WP proxy, or what the resolution was for that SiteGround blocking issue three months ago, or which Notion database stores client portal data – I’m grep-searching through hundreds of files, hoping I remember the right keyword.

Grep works when you know exactly what you’re looking for. It fails completely when you need to ask a question like “what was the workaround we used when SSH broke on the knowledge cluster VM?” That’s a semantic query. It requires understanding, not string matching.

So I built a local vector search system. Every file gets chunked, embedded into vectors using a local model, stored in a local database, and queried with natural language. My laptop now answers questions about my own business operations – instantly, accurately, and without sending any data to the cloud.

The Architecture: Ollama + ChromaDB + Python

The stack is deliberately minimal. Three components, all running locally, zero cloud dependencies.

Ollama with nomic-embed-text handles the embedding. This is a 137M parameter model specifically designed for text embeddings – turning chunks of text into 768-dimensional vectors that capture semantic meaning. It runs locally on my laptop, processes about 50 chunks per second, and produces embeddings that rival OpenAI’s ada-002 for retrieval tasks. The entire model is 274MB on disk.

ChromaDB is the vector database. It’s an open-source, embedded vector store that runs as a Python library – no server process, no Docker container, no infrastructure. Data is persisted to a local directory. The entire 468-file index, with all embeddings and metadata, takes up 180MB on disk. Queries return results in under 100 milliseconds.

A Python script ties it together. The indexer walks through designated directories, reads each file, splits it into chunks of ~500 tokens with 50-token overlap, generates embeddings via Ollama, and stores them in ChromaDB with metadata (file path, chunk number, file type, last modified date). The query interface takes a natural language question, embeds it, searches for the 5 most similar chunks, and returns the relevant passages with source attribution.

What Gets Indexed

I index four categories of files:

Skills (60+ files): Every SKILL.md file in my skills directory. These contain operational instructions for WordPress publishing, SEO optimization, content generation, site auditing, Notion logging, and more. When I ask “how do I connect to the a luxury asset lender WordPress site?” the system retrieves the exact credentials and connection method from the wp-site-registry skill.

Session transcripts (200+ files): Exported transcripts from Cowork sessions. These contain the full history of decisions, troubleshooting, and solutions. When I ask “what was the fix for the WinError 206 issue?” it retrieves the exact conversation where we diagnosed and solved that problem – publish one article per PowerShell call, never combine multiple article bodies in a single command.

Project documentation (100+ files): Architecture documents, API documentation, configuration files, and project briefs. Technical reference material that I wrote once and need to recall later.

Notion exports (50+ files): Periodic exports of key Notion databases – the task board, client records, content calendars, and operational notes. This bridges the gap between Notion (where I plan) and local files (where I execute).

How the Chunking Strategy Matters

The most underrated part of building a RAG system is chunking – how you split documents into pieces before embedding them. Get this wrong and your retrieval is useless regardless of how good your embedding model is.

I tested three approaches:

Fixed-size chunks (500 tokens): Simple but crude. Splits mid-sentence, mid-paragraph, sometimes mid-code-block. Retrieval accuracy was around 65% on my test queries – too many chunks lacked enough context to be useful.

Paragraph-based chunks: Split on double newlines. Better for prose documents but terrible for skill files and code, where a single paragraph might be 2,000 tokens (too large) or 10 tokens (too small). Retrieval accuracy improved to about 72%.

Semantic chunking with overlap: Split at ~500 tokens but respect sentence boundaries, and include 50 tokens of overlap between consecutive chunks. This means the end of chunk N appears at the beginning of chunk N+1, providing continuity. Additionally, each chunk gets prepended with the document title and the nearest H2 heading for context. Retrieval accuracy jumped to 89%.

The overlap and heading prepend were the critical improvements. Without overlap, answers that span two chunks get lost. Without heading context, a chunk about “connection method” could be about any of 18 sites – the heading tells the model which site it’s about.

Real Queries I Run Daily

This isn’t a science project. I use this system every day. Here are actual queries from the past week:

“What are the credentials for the an events platform WordPress site?” – Returns the exact username (will@engagesimply.com), app password, and the note that an events platform uses an email as username, not “Will.” Found in the wp-site-registry skill file.

“How does the 247RS GCP publisher work?” – Returns the service URL, auth header format, and the explanation that SiteGround blocks all direct and proxy calls, requiring the dedicated Cloud Run publisher. Pulled from both the 247rs-site-operations skill and a session transcript where we built it.

“What was the disk space issue on the knowledge cluster VM?” – Returns the session transcript passage about SSH dying because the 20GB boot disk filled to 98%, the startup script workaround, and the IAP tunneling backup method we configured afterward.

“Which sites use Flywheel hosting?” – Returns a list: a flooring company (a flooring company.com), a live comedy platform (a comedy streaming site), an events platform (an events platform.com). Cross-referenced across multiple skill files and assembled by the retrieval system.

Each query takes under 2 seconds – embedding the question (~50ms), vector search (~80ms), and displaying results with source file paths. No API call. No internet required. No data leaves my machine.

Why Local Beats Cloud for This Use Case

Security is absolute. These files contain API credentials, client information, business strategies, and operational playbooks. Uploading them to a cloud embedding service – even a reputable one – introduces a data handling surface I don’t need. Local means the data never leaves the machine. Period.

Speed is consistent. Cloud API calls for embeddings add 200-500ms of latency per query, plus they’re subject to rate limits and service availability. Local embedding via Ollama is 50ms every time. When I’m mid-session and need an answer fast, consistent sub-second response matters.

Cost is zero. OpenAI charges .0001 per 1K tokens for ada-002 embeddings. That sounds cheap until you’re re-indexing 468 files (roughly 2M tokens) every week – .20 per re-index, /year. Trivial in isolation, but when every tool in my stack has a small recurring cost, they compound. Local eliminates the line item entirely.

Availability is guaranteed. The system works on an airplane, in a coffee shop with no WiFi, during a cloud provider outage. My operational knowledge base is always accessible because it runs on the same machine I’m working on.

Frequently Asked Questions

Can this replace a full knowledge management system like Confluence or Notion?

No – it complements them. Notion is where I create and organize information. The local vector system is where I retrieve it instantly. They serve different functions. Notion is the authoring environment; the vector database is the search layer. I export from Notion periodically and re-index to keep the retrieval system current.

How often do you re-index the files?

Weekly for a full re-index, which takes about 4 minutes for all 468 files. I also run incremental indexing – only re-embedding files modified since the last index – as part of my daily morning script. Incremental indexing typically processes 5-15 files and takes under 30 seconds.

What hardware do you need to run this?

Surprisingly modest. My Windows laptop has 16GB RAM and an Intel i7. The nomic-embed-text model uses about 600MB of RAM while running. ChromaDB adds another 200MB for the index. Total memory overhead: under 1GB. Any modern laptop from the last 3-4 years can handle this comfortably. No GPU required for embeddings – CPU performance is more than adequate.

How does this compare to just using Ctrl+F or grep?

Grep finds exact text matches. Vector search finds semantic matches. If I search for “SiteGround blocking” with grep, I find files that contain those exact words. If I search for “why can’t I connect to the a restoration company site” with vector search, I find the explanation about SiteGround’s WAF blocking API calls – even though the passage might not contain the words “connect” or “a restoration company site” explicitly. The difference is understanding context vs. matching strings.

The Compound Effect

Every file I create makes the system smarter. Every session transcript adds to the searchable history. Every skill I write becomes instantly retrievable. The vector database is a living index of accumulated operational knowledge – and it grows automatically as I work.

Three months ago, the answer to “how did we solve X?” was “let me search through my files for 10 minutes.” Today, the answer takes 2 seconds. Multiply that time savings across 20-30 lookups per week, and the ROI is measured in hours reclaimed – hours that go back into building, not searching.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *