I wanted a few small units for speech-to-text (voice recognition) and text-to-speech (voice output) for my home automation environment, which includes OpenHAB and Rhasspy.
Previously, I have used Rhasspy satellites built with a Raspberry Pi and a small speaker (described here), but now I wanted something simpler and more compact, and with the voice recognition happening in the unit itself, without streaming audio to a server, not even to an in-house server.
Espressif offers a ready-made speech recognition component for their ESP32-S3 processors called esp-sr, which can also be used in an Arduino project … sort of, see below for the nasty details.
The advantages of using esp-sr over a Raspberry Pi based satellite are
- the speech recognition happens on the local processor, so there is no need to stream raw audio over WiFi to a central Rhasspy server.
- it includes some advanced signal processing, like noise reduction and blind source separation when using two microphones.
The disadvantages of using esp-sr versus Rhasspy are
- it can only detect fixed text, like “turn on the lamp“, but not text with variable numbers, like “dim the lamp to 50 percent” — not an issue for me, in most use cases.
- it only does speech-to-text (voice recognition) and not text-to-speech (voice synthesis) … unless you understand Chinese — you have to use a separate service for that, like Rhasspy or a standalone TTS service.
- it only supports English and Chinese, whereare Rhasspy can be configured for many different languages — I’m ok with English, but YMMV.
The plan was to package the ESP32-S3 module with a small speaker, and write software that integrates with my existing home automation environment.
Key features
- low cost: a small device based on an ESP32-S3, total cost of materials ca. €20
- privacy: self-contained speech recognition in the device, no audio streaming to a server
- dynamic: automatic configuration for voice-controlled items defined in OpenHAB
Bird’s eye view of the solution
- a Python script occasionally running on a server
(a) queries OpenHAB for all voice-controlled items,
(b) builds phrases such as “turn on kitchen lamp” and “turn off kitchen lamp” ,
(c) converts them to phonemes to be fed to theesp-srlibrary, and
(d) stores all information in a file on a server. - a small ESP32-S3 module running my software and the Espressif
esp-srspeech recognition library retrieves the phrases file from the server, initializes the library, and then listens to voice commands. - when the ESP32-S3 module detects a voice command, it publishes a message to MQTT, in essentially the same format that a Rhasspy satellite would produce. Therefore, I only need one OpenHAB rule to deal with commands from the new ESP32-S3 module, or an existing Rhasspy satellite.
- voice synthesis uses the Rhasspy text-to-speech engine, the ESP32-S3 module just receives a WAV file from Rhasspy (via MQTT) and plays that on its speaker.
Hardware

I developed this project using a cheap ESP32-S3 dev module from Aliexpress, a clone of the Espressif ESP32-S3-DevKitC1-N16R6. This has 16MB flash and 8 MB PSRAM.
A MAX98357A I2S digital amplifier module is connected to I2S interface #0 and drives the speaker.
Two INMP441 I2S digital microphones module are connected to I2S interface #1.
An analog RGB LED serves as a status indicator: waiting for wakeword, recognized wakeword and waiting for command, recognized command, or timeout.
Build instructions
You will need
- a working installation of OpenHAB, to receive and interpret the commands recognized by MyVoiceBox. I currently use OpenHAB 4.1.1, running under Debian on a virtual x86 machine.
- a working installation of Rhasspy, configured for a TTS engine of your choice. I use Rhasspy 2.5.11, running under Debian on a virtual x86 machine, withe Larynx test-to-speech engine and the blizzard_lessac voice.
- an HTTP server that can serve files to MyVoiceBox. I use Apache 2.4.65 running under Debian on a virtual x86 machine.
For detailed instruction on how to build the hardware, build and flash the firmware, and configure your OpenHAB and Rhasspy installation, see the Github page.
Insights
Using Espressif speech recognition in an Arduino project
I thought I could just write an Arduino project, using the Espressif esp-sr component for speech recognition, wrapped in the ESP_SR Arduino library.
However … in an Arduino project, the wakeword is fixed as “Hi,ESP“, and cannot be changed, because the ESP libraries are precompiled for this particular configuration. I tried the real Arduino IDE as well as Arduino CLI, VSCode with the “Arduino Community Edition” extension, and VSCode with Platformio, all with the same result.
Therefore, this became an ESP-IDF project with “Arduino as a component”. I use VSCode with the ESP-IDF extension.
In the ESP-IDF project, one of several wakewords can be selected through the project configuration tool. The original ESP_SR Arduino library (at least for Core 3.3.0) also has the wakeword “Hi, ESP” hardcoded, so I had to clone that library and make my changes. The cloned library is called ESP_SRx and can be found in the components/ESP_SRx folder in the Github repository.
Acoustic “debugging”
I found it really helpful to be able to listen to the audio signal picked up by the microphones and sent to the speech recognition engine. For this, there are two features, which can be enabled via the web interface:
- record the audio signal while a command is being spoken, and save it to a WAV file on an external server
- record the audio signal while a command is being spoken, and replay it through the speaker immediately afterwards
These features help to address questions like: is there too much echo from the room? is the signal contaminated with electrical noise?
Alternatives
I am aware of the following possible alternatives to this project, with similar objectives and features:
- use a Raspberry Pi based Rhasspy satellite, as described in this blog post. It works, but requires more power, and audio is streamed over WiFi for speech recognition at a central in-home server.
- the ESP32 Rhasspy Satellite project, . It uses a “regular” ESP32, I built one, see this blog post. I used it for output only, because for voice recognition, it requires constant streaming of the audio signal to a central server. Even the wakeword detection is done on the server.
- the Willow project, which also runs on an ESP32-S3 and can use the speech recognition library provided by Espressif. It is a sophisticated open-source project, but its focus appears to be on working with a central in-home server, which they call the “inference server”. Also, it doesn’t have voice output, as far as I can tell. The required hardware is from a small list of devices sold by Espressif, which they call “cheap” at $50 … still more than the ~ €20 hardware cost of the solution described here.