How Voice Recognition Works on Raspberry Pi (and Why It’s Easy to Break) -- Episode 700

Lorraine builds a voice-locked prize box using a Raspberry Pi, servo, microphone and speaker recognition. A hands-on project exploring voice authentication, hardware design, and how easy systems are to break.

Watch the Project

Reviewing the Fort Vox Voice Box Project

The Fort Vox Voice Box is a practical exploration of voice biometrics framed as a hands-on outreach project. Lorraine’s aim is clear from the outset: build a physical box that can only be unlocked using a spoken voice password, and use it as a tool for linguistics and computing outreach with children. The result sits deliberately at the intersection of hardware hacking, applied machine learning, and human behaviour.

Lorraine introduces the concept succinctly: a locked box, a treat inside, and a spoken phrase as the key. As she puts it, it is “a box that’s locked with a password but it’s a voice password,” designed so children can actively try to defeat it during outreach events. The emphasis is not on security in the commercial sense, but on transparency and learning. The electronics are intentionally visible, and the mechanics are simple enough to provoke curiosity rather than mystique.

Concept and System Breakdown

Early in the project, Lorraine breaks the system down into its core components: microphone, speaker, screen, button, servo, and a Raspberry Pi as the controller. This early sketching phase is important, because it reveals one of the recurring themes of the build: reducing complexity by choosing parts that can serve multiple roles.

The decision to use a Seeed ReSpeaker HAT is a good example. It combines microphone input, speaker output, onboard buttons, and LEDs in a single board, reducing wiring and setup effort. Lorraine notes that it “looks perfect actually for what I need,” largely because it avoids juggling separate audio components. This choice also shapes later software decisions, including how audio devices are discovered and selected in code.

From a physical design perspective, the box itself is as important as the electronics. The linguistics team’s requirement that the box be transparent drives several later decisions: how the Raspberry Pi is mounted, how the servo-driven locking mechanism is designed, and how the prize is separated from the electronics. Lorraine repeatedly returns to the idea that children should be able to see how the system works, not just interact with it.

Hardware Decisions and Constraints

As the project moves from concept to assembly, Lorraine reflects on practical constraints. Acrylic thickness, cutting methods, and durability under repeated handling all influence the design. A key concern is robustness: the box must survive being shaken, poked, and tested by children.

This leads to a change in the original locking idea. A freely rotating disc would be visually clear, but too easy to defeat physically. Lorraine recognises this risk: if the plastic is flexible, “you can just stick your hand in it and grab the sweet”. The revised design uses a servo-driven hook that physically blocks the lid, combined with a partial hinge so that only the prize compartment opens. This separation of electronics and reward is both a safety measure and a teaching tool.

Button placement and power switching are also considered carefully. The system needs to reset quickly between users, and adults need a way to intervene without dismantling the box. These considerations feed directly into the software flow later on.

Software Setup and Audio Handling

On the software side, Lorraine is explicit about the importance of environment choices. She deliberately avoids the newest Raspberry Pi OS release, noting that “we do not want bookworm… we need to go older, the bullseye”. This is a practical compatibility decision driven by audio libraries and HAT support, and it is a detail that anyone reproducing the project should pay close attention to.

Audio setup is validated early using low-level tools before being wrapped into Python. Lorraine accepts imperfections here, commenting that the microphones are “not amazing” and that some crackling is acceptable for the use case. This is an important reflection: the project is about relative similarity between voices, not studio-quality recordings.

The OLED screen is brought up next, using standard I²C detection and the Luma library. Lorraine demonstrates example animations not because they are part of the final system, but to confirm that the display pipeline works. This incremental validation approach—test each subsystem in isolation—is consistent throughout the build.

Voice Comparison Logic

The most technically dense part of the project is the voice comparison itself. Lorraine uses the Vosk speech and speaker recognition libraries to extract a speaker “signature” from audio recordings. She is candid about this part of the code, describing much of the maths as “gobbledegook” to her, but the implementation works.

Looking at the Python script, the process is clear. A reference recording (password.wav) is captured and processed to extract a speaker embedding. Each attempt is recorded in the same way, and the two embeddings are compared using cosine distance:

def cosine_dist(x,y):
    return 1 - np.dot(np.array(x),np.array(y)) / np.linalg.norm(x) / np.linalg.norm(y)

The threshold itself is deliberately loose. In the final system, a score below 0.5 triggers the servo to open the box. Lorraine leaves this choice open-ended, noting that it will ultimately be up to the linguistics team to decide what is “close enough” for their experiments.

User Interaction and Feedback

The finished interaction loop is simple and effective. The screen displays short prompts such as “Ready,” “Speak,” and “Calculating,” while audio playback reinforces what the correct phrase should sound like. In the Python code, this is handled with a small OLED helper that redraws the display for each state change.

When a match succeeds, the servo opens the lock, pauses, and then closes again, ready for the next user. Lorraine notes a timing issue here during testing: the box can be slow to close if the servo movement overlaps with manual handling. This is flagged as something to refine, rather than ignored.

The live testing section of the project is particularly revealing. Lorraine and colleagues quickly discover that replay attacks work: recording the correct voice and playing it back can defeat the system. Rather than treating this as a failure, Lorraine treats it as a success for outreach. One participant summarises the moment simply: “broke the system”. For a project designed to provoke discussion about security, this is exactly the outcome she wants.

Reflections and Future Directions

By the end of the build, Lorraine reflects openly on the difficulty of working with audio and the time spent debugging. She admits she “doesn’t like speakers,” but is glad she pushed through because the result is repeatable and scalable. She plans to build multiple units for linguistics outreach events.

Future possibilities are hinted at rather than fully specified. Background noise, accents, and mimicry are all areas of interest. Lorraine is particularly interested in how children adapt their behaviour once they realise that silence and consistency matter, and how different accents affect matching scores.

What stands out is that the Fort Vox Voice Box is not positioned as a finished product, but as a platform for experimentation. The hardware is robust enough for repeated use, the software is readable and modifiable, and the limitations are visible by design. Anyone recreating the project is encouraged, implicitly, to tweak thresholds, refine audio handling, or even deliberately exploit weaknesses as part of the learning experience.

In that sense, the project succeeds not because it is secure, but because it makes the trade-offs of voice authentication tangible.

Supporting Files and Links

- Episode 700 Resources

- Raspberry Pi OS image used in project

- Vosk Alphacephei Audio Model

Bill of Materials

Product Name	Manufacturer	Quantity	Buy Kit
Raspberry Pi 3	RASPBERRY-PI	1	Buy Now
Official Raspberry Pi PSU with UK and Euro Plugs	RASPBERRY-PI	1	Buy Now
Expansion Board, Respeaker Dual Microphone HAT, Raspberry Pi , AI And Voice Applications	SEEED STUDIO	1	Buy Now
Loudspeaker, Stereo Enclosed, 3W, 8ohm, 16 mm x 30 mm x 70 mm	DFROBOT	1	Buy Now
Buckled Cable, Universal, 4 Pin, Grove Module, 50 mm Cable	SEEED STUDIO	1	Buy Now
Cable, Female Jumper to Conversion, 4pin, Grove Modules, 5 PCs Per Pack	SEEED STUDIO	1	Buy Now
Grove 4 pin Male Jumper to Grove 4 pin Conversion Cable (5Pk)	SEEED STUDIO	1	Buy Now
Nano HAT Hacker for Raspberry Pi	Pimoroni	1	Buy Now

Additional Parts

Product Name	Manufacturer	Quantity
A suitable enclosure
A prize (maybe, a chocolate bar!)
OLED

Actions

Top Comments

beacon_dave 1 month ago

It could be interesting to see what confidence score an audio synthesis model like one from WaveNet could achieve.

The model could be trained on easily available sources of audio such as some of the e14 presents episodes before this one.

3 years ago, it looks like about 10mins of high quality audio with transcripts is all that is required to create a viable model.

Deepmind: The Podcast Me, myself and AI
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
DAB 1 month ago

Great project Lorraine.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
beacon_dave 1 month ago

If children are to be involved, it might be worth considering using a more graphical display of the result. A gauge type indicator that shows the current threshold setting required to be reached along with the actual score achieved as as percentage bar.

Might also want an alternative way to get into the box for changing settings and releasing the servo lock without having to resort to the screwdriver every time. If the box locks before it is reloaded, then the researcher will need to supply the correct vox passphrase to open it or resort to the screwdriver.

Perhaps consider adding a mode selector inside to allow a researcher to be able to quickly change between different settings.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
beacon_dave 1 month ago in reply to robogary

When Lorraine mentioned children, the first thing that crossed my mind was Roy Scheider saying "...you are going to need a bigger candy bar...".
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
robogary 1 month ago in reply to kmikemoo

you have to say Open Sesame in a Roy Scheider voice
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
robogary 1 month ago in reply to beacon_dave

and dont run with scissors
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
kmikemoo 1 month ago in reply to robogary

SO... the duck (or an onlooker) has to scream first? Very Hollywood.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
robogary 1 month ago in reply to beacon_dave

I'll put a VOX lock on the Shark Chase Tank so it cant eat rubber ducks when no one is watching.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
beacon_dave 1 month ago

I noticed that when you made the initial passphrase recording you had the top cover open, so there is nothing obstructing the microphones on the ReSpeaker hat. However when you are trying to unlock the box, you now have a plastic cover over the microphones.

You might get better results if you raise the hat up and cut a hole in the cover above each microphone. Pop an acoustic overcover windshield over it to reduce the effects of breath noise.

May also help if you isolate the box from the tabletop with some foam pads.
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel
beacon_dave 1 month ago in reply to lorrainbow

I was getting concerned, your book highlights problems with Zombies but nothing about Werewolves...
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel