RoadTest Review a Raspberry Pi 3 Model B ! - Review

Table of contents

RoadTest: RoadTest Review a Raspberry Pi 3 Model B !

Author: alanmcdonley

Creation date:

Evaluation Type: Independent Products

Did you receive all parts the manufacturer stated would be included in the package?: True

What other parts do you consider comparable to this product?: Raspberry Pi B+

What were the biggest problems encountered?: Version of pocketsphinx used does not log performance data when performing grammar-based recognition.

Detailed Review:

Raspberry Pi 3 Model B Roadtest:

Pocket Sphinx Speech Recognition Performance Comparison

 

Author: Alan McDonley

Sponsor: Element 14 element14.com

Hardware (existing personal resources unless otherwise noted):

Raspberry Pi 3 Model B (sponsor provided)

Raspberry Pi Model B+ 512MB

USB microphone

 

Software:

OS: Raspian Jessie-lite v8

ASR: PocketSphinx 5prealpha branch 4e4e607

TTS: Festival 2.1 Nov 2010

Test Program: mymain.py

(heralds from Neil Davenport git clone https://github.com/bynds/makevoicedemo)

Python: v 2.7

 

Configuration:

    Unconstained Large Vocabulary LM: ~20k uni-grams, 1.4M bi-grams, 3M tri-grams

    Small LM: 136 words in 106 phrase ARPA format single, bi- and tri-gram language model

    JSGF Medium Grammar: supports all 136 words in 106 phrases input to create Small LM

    Pi3: 883052KiB Total Memory, 0 swap used

    PiB+: 380416KiB Total Memory, 0 swap used

 

Overview:

 

This roadtest compares the computational performance of the Raspberry Pi 3 Model B with the Raspberry Pi Model B+ running an automatic speech recognition engine, pocketphinx, (Carnegie Mellon University's open source large vocabulary, speaker-independent continuous speech recognition engine).

 

The performance measurements of interest in this test are "xRT" - relationship between the amount of audio and the amount of CPU time to recognize the speech, and the word error rate "WER" - (substitutions+insertions+deletions) / total words spoken.

 

Three test modes:

1) Large Language Model with file input - 10 simple to complex, unconstrained phrases

2) Small Language Model with microphone input - 10 in-model phrases

3) Grammar-based with microphone input - 10 in-grammar phrases

 

The release of the Raspberry Pi 3 Model B enables versatile speech interfaces in autonomous personal robots. The results of this test show that the Pi 3 (using only one of four cores) can keep up with small language-model speech, freeing the developer from the ardour of grammar development, and expanding the speech interface capability beyond commands to enable the beginings of dialog.

 

The Pi 3 can even enable surprisingly comfortable human-robot interaction using large vocabulary, unconstrained, continuous speech, with tremendous reserve processing resources available.

 

 

Test Bench:

 

Photo shows Raspberry Pi B+ at the bottom of the current robot, the sponsored Raspberry Pi 3 Model B on the right front. A Drok USB power meter, (blue), sits above a small externally powered speaker. The microphone used, center, has a USB interface. Tests were run via remote ssh, over WiFi, into the device under test from terminal windows on a MacMini.

 

image

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

==

 

Summary Test Result:

 

Speech recognition using large language model with unconstrained audio file input on the Pi 3 is 2.4 times faster than on the Pi B+ with identical error rate.

 

Speech recognition using small language model with in-model microphone input on the Pi 3 is 3.72 times faster than on the Pi B+. The Pi 3 had a zero (and sometimes near zero) word error rate, while the Pi B+ showed 37% word error rate.

 

Speech recognition using a medium size grammar with in-grammar microphone input on the Pi 3 had zero errors, while the Pi B+ showed a 3% WER. (The version of pocketsphinx used does not report performance with grammar-based recognition.) Both processors appeared to keep up with the commands in real time.

 

Short video of each processor running is at: https://vimeo.com/169445418

 

Programs, grammar, language model, corpus text, test wav file, log files from every run, and performance for each run are at: https://github.com/slowrunner/Pi3RoadTest

 

This report is located at: https://goo.gl/RrGgCm

 

Test Procedure:

 

1) Record test audio of 10 phrases of various length (input_file.txt):

arecord -f s16_LE -r 16000 test16k.wav

Speak:

Hello

What Time is it

Drive Forward Slowly

Who do you think will win the election

What is the weather forecast

How long have you been running

Turn forty five degrees left

one two three four five six seven eight nine ten

a b c d e f g h i j k l m n o p q r s t u v w x y z

Goodbye

 

2) Run large LM PocketSphinx on the recording in one remote ssh, run top in another (%CPU %Mem)

pocketsphinx_continuous -infile test16k.wav 2>&1 | tee ./psphinx.log

(note power consumption A during and after reco)

 

3) Extract Performance Data (xRT)

./perf.sh >result_Pi<model>_file.log

 

4) Extract recognized phrases

tail -14 >reco_Pi<model>_file.log

 

5) Run PocketSphinx from microphone with small LM, speak 10 in-model phrases

  python mymain.py

(note power consumption A during and after program)

(note %CPU %Mem from top during program execution)

Speak:

Hello

What Time is it

Drive Forward Slowly

How long have you been running

Turn forty five degrees left

Go backward quickly

Is it going to rain

Spin

Stop now

sudo shutdown minus H now

 

6) Copy term output to LMsmall_Pi<model>.log

 

7) Extract Performance Data (xRT)

./perf.sh >result_Pi<model>_10.txt

 

8) Run PocketSphinx from microphone with medium JSGF grammar, speak 10 in-grammar phrases.

(note power consumption A during and after program)

(note %CPU %Mem from top during program execution)

Speak:

Hello

What Time is it

Drive Forward Slowly

How long have you been running

Turn forty five degrees left

Go backward quickly

Is it going to rain

Spin

Stop

sudo shutdown minus H now

 

9) Copy term output to jsgf_Pi<model>.log

 

 

Detailed Test Results:

 

1) pocketsphinx_continuous -infile test16k.wav

Pi B+: top 92-98% CPU 17% memory, 0.40A at 5.02V (+0.08A)

2 word substitutions, 1 deletion, 2 insertions = 5 errors / 67 words

= 7% Word Error Rate WER

5.234 xRT reported Total CPU xRT ( sum fwdtree, fwdflat, and bestpath calculations)

 

Pi 3: top 100% CPU 7% memory, 0.49A (+0.18A)

2 word substitutions, 1 deletion, 2 insertions = 5 errors / 67 words

= 7% Word Error Rate WER

2.168 xRT reported Total CPU xRT (sum fwdtree, fwdflat, and bestpath)

 

2) pocketsphinx python using microphone and small LM

Pi B+: top 90% CPU 5% memory, 0.45A (+0.13A)

4 word substitutions, 9 deletion, 0 insertions = 13 errors / 35 words

= 37% Word Error Rate WER

3.075 xRT reported Total CPU xRT ( sum fwdtree, fwdflat, and bestpath calculations)

Pi 3: top 100% CPU 3% memory, 0.49A (+0.18A)

0 word substitutions, 0 deletion, 0 insertions = 0 errors / 35 words

= 0% Word Error Rate WER

0.826 xRT reported Total CPU xRT (sum fwdtree, fwdflat, and bestpath)

 

3) pocketsphinx python using microphone input and medium grammar:

Pi B+: 1 word substitution, 0 deletion, 0 insertions = 1 error / 34 words

= 3% Word Error Rate WER

 

  Pi 3: 0 word substitutions, 0 deletion, 0 insertions = 0 errors / 34 words

= 0% Word Error Rate WER

 

 

Impact of Findings:

 

There has been a long standing debate between product developers (pragmatists) and speech interface researchers over the role of grammars in speech interfaces. When processing resources, (cycles and memory), are scarce or slow, agreeing on a limited set of words and constraining phrase complexity (a speech grammar) can enable a successful speech interface.

 

Grammar-based-speech interfaces for complex human-machine interaction become arduous to develop and tune. Additionally, the interfaces tend to be fragile with wide disparity of user success. From a software coupling aspect (bad), grammar-based-speech interfaces require the developer to duplicate effort to keep the grammar and the result interpretation tightly in-sync.

 

Unconstrained, continuous speech interfaces using language-model-based recognition require much higher performance from processing resources, but enable more robust interfaces and with much greater utility.

 

For a simple product (or personal robot in my case), of limited functionality, a grammar-based-speech command interface is possible on the Raspberry Pi B+.

 

Since Pi B+ language model recognition is three to five times slower than real-time, small language-model speech cannot be used as a control interface, and large vocabulary, unconstrained (language-model-based) recognition is totally out the question.

 

The release of the Raspberry Pi 3 Model B enables versatile speech interfaces in autonomous personal robots. The results of this test show that the Pi 3 (using only one of four cores) can keep up with small language-model speech, freeing the developer from the ardour of grammar development, and expanding the speech interface capability beyond commands to enable the beginings of dialog.

 

The Pi 3 can even enable surprisingly comfortable human-robot interaction using large vocabulary, unconstrained, continuous speech, with tremendous reserve processing resources available.

 

Sixteen years ago my single board computer based robot ran programs in 32K of memory at 1MHz, had only a one-way interface (Morse Code to human), and had a situational awareness dimension of 12 inches.

 

Today, with the Raspberry Pi 3, there is processing power for two-way communication in human languages, local situation awareness through vision, and global situation awareness - all in the same 1 cubic foot robot.

 

 

About The Author:

 

Alan is a Sr. Software Quality Assurance engineer for Broadsoft Hospitality Group, which provides cloud based communications for hotels and resorts worldwide. Formerly, Alan was a Sr. Development Engineer for IBM's Telephony Speech and Contact Center Services, using speech recognition, speaker identification, and interactive voice response (IVR) technologies.

 

 

 

Acknowledgements:

 

element-14.com: Sponsor of the Raspberry Pi 3 Roadtest and provided Pi 3 for this test

Festival Speech Synthesis System: Copyright Univ. of Edinburgh, and Carnegie Mellon Univ.

PocketSphinx 5prealpha:

Authors: Alan W Black, Evandro Gouvea, David Huggins-Daines,

Alexander Solovets, Vyacheslav Klimov

Assistance: Nickolay V. Shmyrev - you rock guy!

pocket_sphinx_listener.py, main.py: Neil Davenport

http://makezine.com/projects/use-raspberry-pi-for-voice-control/

git clone https://github.com/bynds/makevoicedemo

Anonymous