My first step toward implementing a tinyML keyword spotting model is the collection of audio data from the PDM microphone on the EPD shield. I am going to start from an example project that was posted on Electromaker by knaveen: https://www.electromaker.io/project/view/voice-recognition-at-the-edge . This project uses a PSoC6 BLE Pioneer Kit with EInk Display which is a configuration similar to mine. His project is split into two applications, one for data collection and the other for inferencing. He has also uses Edge Impulse for his model development and created a "port" of the SDK for inferencing since this is not an officially supported board.
I am going to keep his configuration for testing, but in actual use I will create an inferencing task to run in my application. Or, if all goes well I will try creating the equivalent functionality using MTB-ML.
Knaveen has a github repository with all of the MTB files: https://github.com/metanav/Voice_Recognition_PSoC6 . I don't want to deal with potential library version issues, so I'm going to create a new application and just copy and modify his code as required. I'll start with data collection.
Here is the structure of the Keyword_Data_Collection application with the BSP and required libraries shown.
The MCU that Knaveen used had only 288 KB of SRAM which he determined would only hold 7 seconds of PDM/PCM audio sampled at 16 KHz. The PSoC 62S2 has 1 MB of SRAM, but since I'm capturing short phrases or names I'm going to use his setup - at least initially. His program uses the EInk display and the user LEDs to indicate status. The SW2 button initiates the recording. The PDM data from the microphone on the display shield is converted to PCM and transmitted via USB serial to the host PC. A Python script running on the host PC converts and saves the PCM data into a WAV file.
Here's a quick demo of the data capture program operating. My test is to have it try to differentiate between a set of names ("Ralph", "Joe". "Sam"). Apologies for two unsynchronized videos (I'm having some issues with my webcam). The first video is of the operation of the dev board and the second is the Python program running in a command window.
The board resets to a "Start" message on the display. Pressing the user button SW2 starts the 7 second capture, the "Recording" message is displayed and the Red LED turns on. When the capture is finished the "Writing" message is displayed, the Amber LED turns on, and the PCM data is uploaded via USB serial to the PC where the Python script converts it to WAV format (takes about 20 seconds for the 224 KB file). When the process is complete, the "Press SW2" message is displayed to indicate the next capture cycle is ready.
The Python script requires two arguments - the serial port and the output file label. The script checks for the existence of the same file label and appends an incrementing number to keep the recordings unique. The script needs to be restarted for each set of label files. The file counter starts at 10 because I had already captured 0-9.
The result is a set of labeled WAV files that are uploaded to my Edge Impulse Project.
Login to my Project Dashboard
The Edge Impulse Data Acquisition Panel on the project dashboard has a button to invoke the uploader for existing data.
I can directly upload WAV files and have the labels inferred from the filename and also split into Training and Testing sets - very well implemented!
I am uploading 7 second sample files with multiple instances of the keyword. For keyword spotting it works well to split the file into 1 second segments.
That's easily accomplished by selecting the 3 vertical dots on the right and selecting "split sample".
A graphical window pops up to allow you to adjust your segments if required. It is good to select "Shift samples" to randomize the sample position in the 1 second window. You can also deselect "bad" samples.
When all the 7 second files are split, we end up with our dataset of 1 second sample files. It is recommended to balance the number of files per label as indicated by the color wheel at the top of the panel.
With the data uploaded, it's time to create an impulse (model). Since this is a simple keyword spotting example, I'm going to use the default audio settings.
A quick check of the feature extraction using the MFCC processing block.
And then training the Neural Network.
Since this a fairly constrained case I was hoping for better accuracy, but the model is having a problem with some of the samples. One thing that I noticed when I was capturing data was that I was not getting much volume from the PDM microphone. I tried adjusting the PDM/PCM converter settings but didn't get much improvement. This is a problem that I need to fix. I'll post a question on the Infineon PSoC product forum and see if I can get some help.
Here is the result classifying the Test Data set. Similar results to the training data, just a different data label is mis-classified.
At this point things are good enough to test deploying the model to the PSoC 62S2 board. I'll cover that in my next post.