01 / WEB-AUDIO / 2026-05-18
Ask Michael
EXPERIMENTAL: chat with AI Michael. Matched sentences play the real YouTube clip; the rest synthesize in his cloned voice.
// why this exists
real-clip retrieval against Michael's YouTube archive plus voice-cloned fallback
How it works
// 01 DPI and pixel density
The pipeline runs in three pieces. (1) scripts/transcribe-youtube-archive.mjs walks src/data/youtube-videos.ts, downloads audio via yt-dlp, runs whisper.cpp small.en, and writes data/transcripts/<videoId>.json with sentence-level timings. (2) scripts/embed-transcripts.mjs reads those files, calls OpenAI text-embedding-3-small per sentence, and upserts into the michael_transcript_embeddings table with an HNSW cosine index. (3) At runtime the widget hits /api/ask-michael for the streamed reply, splits the response into sentences, and fires /api/ask-michael/search-clip for each one. Matches above 0.78 cosine similarity render as deep-linked YouTube clips; the rest get a "play synthesized voice" button that proxies through /api/ask-michael/tts to ElevenLabs.
Phase 1 of the TTS-with-clip-matching experiment. Ask AI Michael any question about the work, the stack, or the build process. Claude Haiku streams the reply. Each completed sentence in the response is embedded and run against an index of every sentence Michael has ever said on YouTube (149 videos, sentence-level transcripts, OpenAI text-embedding-3-small, pgvector cosine similarity). When a sentence matches a real clip with high confidence, the widget links straight to the YouTube moment instead of synthesizing. When it does not, ElevenLabs synthesizes the sentence in Michael's cloned voice on demand.
Frequently asked questions
Why match against real clips at all?
Synthetic voice is convincing but it is not the real moment. When AI Michael says something he has actually said, the real clip is more honest and more interesting than the synth.
What is the matching threshold?
Cosine similarity above 0.78 against text-embedding-3-small. Tuned conservatively so close-but-not-quite matches fall through to TTS rather than misattributing a clip.
Where do the transcripts come from?
Whisper.cpp small.en running locally on Michael's machine. The transcripts live in data/transcripts/ as plain JSON, regenerated when new videos publish.
What about voice consent?
The cloned voice is Michael's own, cloned through ElevenLabs on his own account using his own audio. No other voices, no impersonation.
Why EXPERIMENTAL?
The infrastructure is built; the index is still small while transcription catches up. Expect rough edges until the archive is fully indexed and the threshold is tuned per category.

