Ask the archive

What is “Ask the Archive”?

I wanted a system that can answer questions about my work, but only from sources.

Private material can point the system toward the right page or timestamp, but it is designed so that unpublished, unreviewed text never reaches the part that writes the answer.

My podcast episodes are transcribed by speech-to-text models, local Whisper for solo episodes, OpenAI’s speaker-separated transcription for interviews, and pass through human review, keeping a few things apart: me from my guests, claim from record, evidence from synthesis.

The goal is a system that admits uncertainty instead of laundering it into confidence. When the archive does not have enough to answer well, I would rather it say so plainly and send you to the record.

So it always points back to the accountable surface: pages, timestamps, citations, reviewed summaries. The sources are the record. The answer is only the concierge.

In other words, an attempt at answerability as interface design.

I want a site that works with whoever is asking, human or AI, to find the record and show its work, while refusing to overclaim.

How it’s built

The pipeline is deliberately plain. Most of the care goes into the boundaries, not the model.

Speech-to-text turns episode audio into timestamped text: solo episodes through Whisper running locally, interviews through OpenAI’s diarized transcription, guided by a keyterm list of names and titles so the proper nouns come out right. I review and label every transcript before any of it can become public, separating my words from my guests’ and marking anything I am unsure about as unverified.

The reviewed transcripts and the site’s pages are split into passages and embedded with OpenAI, then searched with a hybrid of exact matching and meaning, so a name search and a concept search both land. A model then writes a short answer from only the public, reviewed evidence it is handed, cites its sources, and is graded against a fixed set of test questions that check for the failures I care about most: inventing facts, putting a guest’s words in my mouth, or exposing anything private.

Podcast transcripts are part of the public archive. They may be retrieved and cited as records, but the system must not turn them into uncited private knowledge or persona-voice — guest speech is not my speech, and nothing becomes model prose without a citation.

If the answer layer goes down, search still works. The record does not depend on the model.

At scale

The teaching-sized answer-engine repository runs the same contract over a synthetic corpus you can clone and check. This site’s production deployment is several orders of magnitude larger — and the figures are real but not reproducible, because the corpus is private.

The index holds 10,554 embedding vectors: 9,777 private passages from 210 podcast episodes (~123 hours), chunked into overlapping windows, and 777 public records — published pages, songs, albums, essays, letters, and per-episode notes. By words the split is starker: roughly 186,000 public against 1.75 million private. Those private words are searchable but never quotable; they reach the model only as routing hints.

Almost all operating cost sits in the embedding index — its memory footprint and the latency of loading it into a stateless serving instance. Three levers reduce that cost, and none of them alters the contract: chunking long documents into passages (the dominant driver of index size here), quantizing vectors for transport (~616 MB to ~53 MB on the wire, with ranking checked against the gold suite before acceptance), and — held but not yet spent — Matryoshka dimension truncation. Every choice was admitted or rejected by the same fixed eval that governs grounding and refusal, not by intuition.

The full design contract, evaluation harness, and these scaling notes are in the Answer Engine technical note (v1.1 · PDF · DOI).

Ask the Archive is the live deployment of Answer Engine on this site. The Answerability Quartet papers stand on their own; the runnable pattern — technical note, repository, and this pipeline — is set apart at /research/#implementation. The note is documented at Zenodo (v1.1 · DOI · lukefwalton.com · CC BY-NC-ND 4.0 ). Clone github.com/lukefwalton/answer-engine (DOI · lukefwalton.com · GitHub · Apache-2.0 ) to read the teaching-sized implementation. Production transcription and eval live in the ask-the-archive folder of github.com/lukefwalton/lukefwalton.com. A system that argues for answerability should let you check how it answers.

One question, one sourced answer. Try: