RoadTest: RoadTest Review a Raspberry Pi 3 Model B !
Author: alanmcdonley
Creation date:
Evaluation Type: Independent Products
Did you receive all parts the manufacturer stated would be included in the package?: True
What other parts do you consider comparable to this product?: Raspberry Pi B+
What were the biggest problems encountered?: Version of pocketsphinx used does not log performance data when performing grammar-based recognition.
Detailed Review:
Raspberry Pi 3 Model B Roadtest:
Pocket Sphinx Speech Recognition Performance Comparison
Author: Alan McDonley
Sponsor: Element 14 element14.com
Hardware (existing personal resources unless otherwise noted):
Raspberry Pi 3 Model B (sponsor provided)
Raspberry Pi Model B+ 512MB
USB microphone
Software:
OS: Raspian Jessie-lite v8
ASR: PocketSphinx 5prealpha branch 4e4e607
TTS: Festival 2.1 Nov 2010
Test Program: mymain.py
(heralds from Neil Davenport git clone https://github.com/bynds/makevoicedemo)
Python: v 2.7
Configuration:
Unconstained Large Vocabulary LM: ~20k uni-grams, 1.4M bi-grams, 3M tri-grams
Small LM: 136 words in 106 phrase ARPA format single, bi- and tri-gram language model
JSGF Medium Grammar: supports all 136 words in 106 phrases input to create Small LM
Pi3: 883052KiB Total Memory, 0 swap used
PiB+: 380416KiB Total Memory, 0 swap used
Overview:
This roadtest compares the computational performance of the Raspberry Pi 3 Model B with the Raspberry Pi Model B+ running an automatic speech recognition engine, pocketphinx, (Carnegie Mellon University's open source large vocabulary, speaker-independent continuous speech recognition engine).
The performance measurements of interest in this test are "xRT" - relationship between the amount of audio and the amount of CPU time to recognize the speech, and the word error rate "WER" - (substitutions+insertions+deletions) / total words spoken.
Three test modes:
1) Large Language Model with file input - 10 simple to complex, unconstrained phrases
2) Small Language Model with microphone input - 10 in-model phrases
3) Grammar-based with microphone input - 10 in-grammar phrases
The release of the Raspberry Pi 3 Model B enables versatile speech interfaces in autonomous personal robots. The results of this test show that the Pi 3 (using only one of four cores) can keep up with small language-model speech, freeing the developer from the ardour of grammar development, and expanding the speech interface capability beyond commands to enable the beginings of dialog.
The Pi 3 can even enable surprisingly comfortable human-robot interaction using large vocabulary, unconstrained, continuous speech, with tremendous reserve processing resources available.
Test Bench:
Photo shows Raspberry Pi B+ at the bottom of the current robot, the sponsored Raspberry Pi 3 Model B on the right front. A Drok USB power meter, (blue), sits above a small externally powered speaker. The microphone used, center, has a USB interface. Tests were run via remote ssh, over WiFi, into the device under test from terminal windows on a MacMini.
==
Summary Test Result:
Speech recognition using large language model with unconstrained audio file input on the Pi 3 is 2.4 times faster than on the Pi B+ with identical error rate.
Speech recognition using small language model with in-model microphone input on the Pi 3 is 3.72 times faster than on the Pi B+. The Pi 3 had a zero (and sometimes near zero) word error rate, while the Pi B+ showed 37% word error rate.
Speech recognition using a medium size grammar with in-grammar microphone input on the Pi 3 had zero errors, while the Pi B+ showed a 3% WER. (The version of pocketsphinx used does not report performance with grammar-based recognition.) Both processors appeared to keep up with the commands in real time.
Short video of each processor running is at: https://vimeo.com/169445418
Programs, grammar, language model, corpus text, test wav file, log files from every run, and performance for each run are at: https://github.com/slowrunner/Pi3RoadTest
This report is located at: https://goo.gl/RrGgCm
Test Procedure:
1) Record test audio of 10 phrases of various length (input_file.txt):
arecord -f s16_LE -r 16000 test16k.wav
Speak:
Hello
What Time is it
Drive Forward Slowly
Who do you think will win the election
What is the weather forecast
How long have you been running
Turn forty five degrees left
one two three four five six seven eight nine ten
a b c d e f g h i j k l m n o p q r s t u v w x y z
Goodbye
2) Run large LM PocketSphinx on the recording in one remote ssh, run top in another (%CPU %Mem)
pocketsphinx_continuous -infile test16k.wav 2>&1 | tee ./psphinx.log
(note power consumption A during and after reco)
3) Extract Performance Data (xRT)
./perf.sh >result_Pi<model>_file.log
4) Extract recognized phrases
tail -14 >reco_Pi<model>_file.log
5) Run PocketSphinx from microphone with small LM, speak 10 in-model phrases
python mymain.py
(note power consumption A during and after program)
(note %CPU %Mem from top during program execution)
Speak:
Hello
What Time is it
Drive Forward Slowly
How long have you been running
Turn forty five degrees left
Go backward quickly
Is it going to rain
Spin
Stop now
sudo shutdown minus H now
6) Copy term output to LMsmall_Pi<model>.log
7) Extract Performance Data (xRT)
./perf.sh >result_Pi<model>_10.txt
8) Run PocketSphinx from microphone with medium JSGF grammar, speak 10 in-grammar phrases.
(note power consumption A during and after program)
(note %CPU %Mem from top during program execution)
Speak:
Hello
What Time is it
Drive Forward Slowly
How long have you been running
Turn forty five degrees left
Go backward quickly
Is it going to rain
Spin
Stop
sudo shutdown minus H now
9) Copy term output to jsgf_Pi<model>.log
Detailed Test Results:
1) pocketsphinx_continuous -infile test16k.wav
Pi B+: top 92-98% CPU 17% memory, 0.40A at 5.02V (+0.08A)
2 word substitutions, 1 deletion, 2 insertions = 5 errors / 67 words
= 7% Word Error Rate WER
5.234 xRT reported Total CPU xRT ( sum fwdtree, fwdflat, and bestpath calculations)
Pi 3: top 100% CPU 7% memory, 0.49A (+0.18A)
2 word substitutions, 1 deletion, 2 insertions = 5 errors / 67 words
= 7% Word Error Rate WER
2.168 xRT reported Total CPU xRT (sum fwdtree, fwdflat, and bestpath)
2) pocketsphinx python using microphone and small LM
Pi B+: top 90% CPU 5% memory, 0.45A (+0.13A)
4 word substitutions, 9 deletion, 0 insertions = 13 errors / 35 words
= 37% Word Error Rate WER
3.075 xRT reported Total CPU xRT ( sum fwdtree, fwdflat, and bestpath calculations)
Pi 3: top 100% CPU 3% memory, 0.49A (+0.18A)
0 word substitutions, 0 deletion, 0 insertions = 0 errors / 35 words
= 0% Word Error Rate WER
0.826 xRT reported Total CPU xRT (sum fwdtree, fwdflat, and bestpath)
3) pocketsphinx python using microphone input and medium grammar:
Pi B+: 1 word substitution, 0 deletion, 0 insertions = 1 error / 34 words
= 3% Word Error Rate WER
Pi 3: 0 word substitutions, 0 deletion, 0 insertions = 0 errors / 34 words
= 0% Word Error Rate WER
Impact of Findings:
There has been a long standing debate between product developers (pragmatists) and speech interface researchers over the role of grammars in speech interfaces. When processing resources, (cycles and memory), are scarce or slow, agreeing on a limited set of words and constraining phrase complexity (a speech grammar) can enable a successful speech interface.
Grammar-based-speech interfaces for complex human-machine interaction become arduous to develop and tune. Additionally, the interfaces tend to be fragile with wide disparity of user success. From a software coupling aspect (bad), grammar-based-speech interfaces require the developer to duplicate effort to keep the grammar and the result interpretation tightly in-sync.
Unconstrained, continuous speech interfaces using language-model-based recognition require much higher performance from processing resources, but enable more robust interfaces and with much greater utility.
For a simple product (or personal robot in my case), of limited functionality, a grammar-based-speech command interface is possible on the Raspberry Pi B+.
Since Pi B+ language model recognition is three to five times slower than real-time, small language-model speech cannot be used as a control interface, and large vocabulary, unconstrained (language-model-based) recognition is totally out the question.
The release of the Raspberry Pi 3 Model B enables versatile speech interfaces in autonomous personal robots. The results of this test show that the Pi 3 (using only one of four cores) can keep up with small language-model speech, freeing the developer from the ardour of grammar development, and expanding the speech interface capability beyond commands to enable the beginings of dialog.
The Pi 3 can even enable surprisingly comfortable human-robot interaction using large vocabulary, unconstrained, continuous speech, with tremendous reserve processing resources available.
Sixteen years ago my single board computer based robot ran programs in 32K of memory at 1MHz, had only a one-way interface (Morse Code to human), and had a situational awareness dimension of 12 inches.
Today, with the Raspberry Pi 3, there is processing power for two-way communication in human languages, local situation awareness through vision, and global situation awareness - all in the same 1 cubic foot robot.
About The Author:
Alan is a Sr. Software Quality Assurance engineer for Broadsoft Hospitality Group, which provides cloud based communications for hotels and resorts worldwide. Formerly, Alan was a Sr. Development Engineer for IBM's Telephony Speech and Contact Center Services, using speech recognition, speaker identification, and interactive voice response (IVR) technologies.
Acknowledgements:
element-14.com: Sponsor of the Raspberry Pi 3 Roadtest and provided Pi 3 for this test
Festival Speech Synthesis System: Copyright Univ. of Edinburgh, and Carnegie Mellon Univ.
PocketSphinx 5prealpha:
Authors: Alan W Black, Evandro Gouvea, David Huggins-Daines,
Alexander Solovets, Vyacheslav Klimov
Assistance: Nickolay V. Shmyrev - you rock guy!
pocket_sphinx_listener.py, main.py: Neil Davenport
http://makezine.com/projects/use-raspberry-pi-for-voice-control/
git clone https://github.com/bynds/makevoicedemo
Top Comments
Hi Alan,
Nice review! Finally speech recognition on the 'Pi seems usable.
This is really great to see.
Very good road test review.
Great test on a third party application software to measure processor loading.
I am curious, did you have the software tuned to use all four processor cores on the RPI 3?
The CPU…
A very nice and well focused review! My compliments
Enrico