How I Built Sancho — A Claude-Powered Robot from Scratch

I named the robot Sancho. Not because I was going for cute. Because the name is philosophically accurate — but I’ll get to that in another post.

What I want to explain here is the build. How a physical robot came to live in my house, respond to my voice, control my lights, remember our conversations, and stare at me with animated eyes that actually react when it’s thinking.

This is not a demo. It runs. It’s sitting in my living room right now.

Why build a robot at all

I work in software. I build cloud systems, billing platforms, AI integrations. Everything I make lives in a browser or a terminal. The stakes are high in production, but the feedback loop is abstract — logs, metrics, dashboards.

A robot changes that completely.

When software controls something physical, every bug has a physical consequence. A misconfigured tool call doesn’t just throw an error — it might make the robot spin in a circle, or not stop, or speak out loud at 2am. The constraints are real: battery life, latency, heat, sensor noise, motor tolerances. There’s no “it works on my machine” because your machine is right there on the floor.

I wanted to push against those limits. I wanted to understand what the Claude SDK can actually do when the environment is physical and the stakes are tangible.

The name: Sancho Panza

Sancho Panza is Don Quixote’s squire in Cervantes’ novel. The Don is a dreamer — he sees giants where there are windmills. Sancho is the opposite: grounded, practical, loyal. He follows the Don not because he believes in the delusions, but because he’s useful, present, and genuinely cares about getting things done.

That’s what I wanted from this robot. Not an AI that hallucinates, over-promises, or tries to be impressive. One that’s actually useful. That listens, remembers, acts, and doesn’t burn the house down.

Claude turned out to be the right model for that — but more on that later.

Hardware: what I actually used

I didn’t start from scratch in the sense of machining parts. But I did make deliberate choices about every component.

Chassis: Osoyoo Flexirover robot car kit. It’s a four-wheel drive platform with enough room to mount a Raspberry Pi 5, a camera, a screen, and a speaker without the whole thing collapsing. I reverse-engineered a Roomba battery to power it — higher capacity, better discharge curve than the kit battery, and it fits with some bracket modification.

Compute: Raspberry Pi 5 (8GB). This matters. The Pi 4 couldn’t handle the real-time audio processing alongside the WebSocket streams and model inference. The Pi 5 has enough headroom to run faster-whisper locally, handle the FastAPI backend, and still not drop frames on the display.

Camera: Raspberry Pi Camera Module 3. Wide-angle lens. Used for computer vision tasks and for giving the robot a “face” in the dashboard UI.

Display: Small HDMI screen mounted on the front, running the animated eyes UI. The eyes are an SVG animation driven by the robot’s current state — idle, thinking, speaking, surprised. People underestimate how much this matters. The eyes make it feel like something is home.

Audio: USB microphone for input, small speaker for output. Simple. The audio pipeline is what’s interesting, not the hardware.

The voice pipeline: no cloud, no latency

I made one hard architectural decision early: the voice pipeline would run locally, offline. No cloud STT. No API calls for speech synthesis.

The reasons are practical:

Latency. Round-tripping audio to a cloud service adds 300–800ms on a good day. That’s noticeable in conversation.
Reliability. A robot that loses its voice when the internet hiccups is useless.
Privacy. Everything said in the room goes through the microphone.

Speech-to-text: faster-whisper. This is a reimplementation of OpenAI’s Whisper model using CTranslate2, optimized for CPU inference. On the Pi 5, the tiny model transcribes a short sentence in under 200ms. Good enough for natural conversation.

Text-to-speech: Piper. Piper is an offline TTS system developed by Rhasspy. It runs entirely on-device, produces reasonably natural speech, and has enough voice options that Sancho doesn’t sound like a 1990s text reader. I picked a voice that feels calm and deliberate — it matches the Sancho personality.

The pipeline: audio captured in chunks → VAD (voice activity detection) to detect speech start/end → faster-whisper transcribes → result sent to FastAPI backend → Claude processes → response streamed back → Piper synthesizes → plays through speaker. End-to-end latency in a quiet room: under 1.5 seconds. That’s livable.

The backend: FastAPI doing everything

The backend is a FastAPI application running on the Pi. It’s the center of everything — it handles:

Audio stream management
Tool dispatch and registration
WebSocket connections for the frontend dashboard
Home Assistant API calls
Claude SDK integration

I kept it as a single FastAPI app deliberately. Microservices on a Raspberry Pi is a fantasy — you don’t have the RAM budget, and the network round-trips between local services add up. One process, one event loop, async everywhere.

The Claude SDK integration is where it gets interesting.

How Claude is wired in

Every conversation goes through the Anthropic SDK’s messages API. Claude has access to tools. A lot of tools.

The architecture I settled on is tool auto-discovery. Any Python file I drop into the tools/ directory is automatically registered and available to Claude on the next restart. The registration code inspects each file, extracts the function signature and docstring, and generates the tool schema dynamically.

This matters because it removes friction from adding new capabilities. When I want to give Sancho a new ability, I write one Python file. I don’t touch the core system. I don’t update a registry. I just drop the file and restart.

Current tools:

lights.py — Home Assistant calls to control lights by room, scene, brightness
music.py — Spotify/local playback control via Home Assistant
web_search.py — DuckDuckGo search, scrapes and summarizes results
whatsapp.py — sends messages via WhatsApp Web automation
memory_read.py / memory_write.py — read and write to ChromaDB semantic memory
timer.py — sets countdown timers with audio alerts
notes.py — quick notes to PostgreSQL, recallable by topic

Each tool has a clear docstring that tells Claude what it does, what the parameters mean, and when to use it. That docstring is the interface. If the docstring is good, Claude uses the tool correctly. If it’s vague, Claude guesses. I learned this the hard way with the lights tool before I added specific parameter descriptions.

Memory: ChromaDB + PostgreSQL

One of the things I wanted from the start was a robot that remembers. Not just within a session — across sessions. If I tell Sancho something on Tuesday, I want it to still know on Friday.

The memory system has two layers:

ChromaDB for semantic search. When Sancho wants to retrieve something from memory, it embeds the query and searches for semantically similar stored memories. This handles fuzzy recall — “what did I tell you about the guest bedroom light?” — without requiring exact keyword matches.

PostgreSQL for structured session data. Every conversation is logged: timestamp, session ID, messages, tools called, outcomes. This is the audit trail. It’s also useful for my dashboard — I can see how often Sancho used each tool, what it was asked about most, where it failed.

The memory write happens automatically at the end of each significant exchange. I define “significant” loosely — any turn where a new fact was stated, a preference was expressed, or an action was taken gets written to memory. Claude decides what counts via a summarization step before the session closes.

Is this perfect? No. ChromaDB retrieval occasionally returns irrelevant memories. The summarization step sometimes misses nuance. But it’s good enough that Sancho feels like it has continuity, which is the experience that matters.

Smart home integration

Home Assistant is the bridge between Sancho and the rest of my apartment. HA exposes a REST API that covers every device I have: lights, switches, media players, climate control.

The integration is simple by design. The lights.py tool makes authenticated HTTP calls to the HA API. Sancho doesn’t know what protocol controls what device — it just calls the tool with an intent (“turn off the bedroom lights”) and the tool handles the HA translation.

The tricky part was entity naming. Home Assistant uses entity IDs like light.bedroom_main — not natural language. I built a small alias layer that maps common phrases (“bedroom lights,” “reading lamp,” “overhead”) to actual entity IDs. Claude uses the alias, the tool resolves it. No hallucinated entity names hitting the API.

The dashboard

The frontend is React + Vite, running in a browser on my phone or laptop. It connects to the FastAPI backend via WebSocket and gives me:

Live chat interface — I can type instead of speak
Robot state display — what Sancho is currently doing (idle, listening, thinking, speaking)
Camera feed — so I can check what it sees
Memory browser — searchable view of stored memories
Tool call log — every tool invocation, parameters, result, timestamp

The dashboard isn’t for showing off. It’s for debugging. When Sancho does something unexpected, I open the tool call log and trace exactly what happened. This loop — observe, debug, fix — is what makes the system actually improve over time.

What I learned

Claude is good at judgment calls under constraints. Given a clear system prompt and well-defined tools, Claude almost never calls the wrong tool. Where it struggles is with ambiguous inputs — “turn that off” when “that” could be three things. I solved this by having Claude ask a clarifying question before acting on ambiguous commands, rather than guessing.

Offline pipelines change the design space. Once I committed to local voice processing, I stopped caring about API uptime, latency spikes, or rate limits. The tradeoff is accuracy — faster-whisper tiny isn’t as accurate as Whisper large-v3. But for practical household commands, it’s more than good enough.

Tool docstrings are your interface. The quality of Claude’s tool use is almost entirely determined by the quality of the docstrings. I spent as much time writing tool descriptions as I did writing tool implementations. That ratio was the right call.

Semantic memory is forgiving but not precise. ChromaDB does what it promises — fuzzy retrieval, no exact matches required. But it retrieves by similarity, not by relevance, and those aren’t always the same thing. I added a re-ranking step that prompts Claude to evaluate retrieved memories before using them, which cut false-positive recalls significantly.

Physical stakes sharpen your thinking. When a bug causes a robot to drive into a wall, you fix it. When a bug causes weird log output, you might deprioritize it. The physicality of the project forced a level of correctness I don’t always bring to purely software work.

What’s next

The mechanical side is the weakest part right now. Navigation is basic — differential drive, no SLAM, no map. Sancho can’t autonomously move through my apartment in any sophisticated way. That’s the next frontier: computer vision-driven navigation using the camera, combined with a semantic map of the space stored in memory.

I’m also experimenting with giving Sancho the ability to initiate conversations — not just respond. Right now it’s fully reactive. I want to try ambient awareness: Sancho notices I’ve been at my desk for four hours and suggests I take a break. That’s a different interaction model from the current request-response loop, and it requires careful thought about when to interrupt versus when to stay quiet.

The robot is named Sancho because he’s useful and present, not because he’s impressive. The goal isn’t to build something that looks cool in a demo. It’s to build something I actually use every day. By that metric, it’s already a success.

The code is private for now — too many hardcoded personal details and home network specifics to publish cleanly. I’ll extract the core architecture into a public repo when I have time to sanitize it properly.

If you have questions about any part of this, the chatbot on this site knows everything I’ve described here. Ask it.