element14 Community
element14 Community
    Register Log In
  • Site
  • Search
  • Log In Register
  • Community Hub
    Community Hub
    • What's New on element14
    • Feedback and Support
    • Benefits of Membership
    • Personal Blogs
    • Members Area
    • Achievement Levels
  • Learn
    Learn
    • Ask an Expert
    • eBooks
    • element14 presents
    • Learning Center
    • Tech Spotlight
    • STEM Academy
    • Webinars, Training and Events
    • Learning Groups
  • Technologies
    Technologies
    • 3D Printing
    • FPGA
    • Industrial Automation
    • Internet of Things
    • Power & Energy
    • Sensors
    • Technology Groups
  • Challenges & Projects
    Challenges & Projects
    • Design Challenges
    • element14 presents Projects
    • Project14
    • Arduino Projects
    • Raspberry Pi Projects
    • Project Groups
  • Products
    Products
    • Arduino
    • Avnet Boards Community
    • Dev Tools
    • Manufacturers
    • Multicomp Pro
    • Product Groups
    • Raspberry Pi
    • RoadTests & Reviews
  • About Us
  • Store
    Store
    • Visit Your Store
    • Choose another store...
      • Europe
      •  Austria (German)
      •  Belgium (Dutch, French)
      •  Bulgaria (Bulgarian)
      •  Czech Republic (Czech)
      •  Denmark (Danish)
      •  Estonia (Estonian)
      •  Finland (Finnish)
      •  France (French)
      •  Germany (German)
      •  Hungary (Hungarian)
      •  Ireland
      •  Israel
      •  Italy (Italian)
      •  Latvia (Latvian)
      •  
      •  Lithuania (Lithuanian)
      •  Netherlands (Dutch)
      •  Norway (Norwegian)
      •  Poland (Polish)
      •  Portugal (Portuguese)
      •  Romania (Romanian)
      •  Russia (Russian)
      •  Slovakia (Slovak)
      •  Slovenia (Slovenian)
      •  Spain (Spanish)
      •  Sweden (Swedish)
      •  Switzerland(German, French)
      •  Turkey (Turkish)
      •  United Kingdom
      • Asia Pacific
      •  Australia
      •  China
      •  Hong Kong
      •  India
      •  Korea (Korean)
      •  Malaysia
      •  New Zealand
      •  Philippines
      •  Singapore
      •  Taiwan
      •  Thailand (Thai)
      • Americas
      •  Brazil (Portuguese)
      •  Canada
      •  Mexico (Spanish)
      •  United States
      Can't find the country/region you're looking for? Visit our export site or find a local distributor.
  • Translate
  • Profile
  • Settings
Single-Board Computers
  • Products
  • Dev Tools
  • Single-Board Computers
  • More
  • Cancel
Single-Board Computers
Blog Optimizing Machine Learning on MaaXBoard Part 1: Delegates
  • Blog
  • Forum
  • Documents
  • Files
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • More
  • Cancel
  • New
Join Single-Board Computers to participate - click to join for free!
  • Share
  • More
  • Cancel
Group Actions
  • Group RSS
  • More
  • Cancel
Engagement
  • Author Author: zebular13
  • Date Created: 27 May 2021 11:54 PM Date Created
  • Views 1694 views
  • Likes 4 likes
  • Comments 0 comments
  • xnnpack
  • yocto
  • imx
  • nnapi
  • nxp
  • maaxboard
  • delegates
Related
Recommended

Optimizing Machine Learning on MaaXBoard Part 1: Delegates

zebular13
zebular13
27 May 2021

I've been playing around with the eIQ tools on the Yocto image for MaaXBoard recently. Until now, I've mostly focused on Debian, but the eIQ layer for MaaXBoard Yocto was released earlier this year and it makes a number of machine learning tools available on the Yocto image.

What is eIQ? It's NXP's custom collection of Neural Network compilers, libraries, and hardware abstraction layers. The following five inference engines are supported by eIQ:

  • ArmNN
  • TensorFlow Lite
  • ONNX Runtime
  • PyTorch
  • OpenCV

image

eIQ supports all of the i.MX processors based on Arm Cortex-A and Arm Cortex-M cores, as well as GPUs, DSPs, and NPUs when available. It's easy to build into your Yocto image for MaaXBoard, and I recently wrote the following tutorials so you can get started:

Getting Started with Yocto on MaaXBoardBuilding Your own Yocto for MaaXBoardRunning Machine Learning on MaaXBoard's Yocto Image

image

imageimage

 

NXP just released their most recent version of the eIQ tools, as well as a new imx machine learning User’s Guide in April 2011 (the old guide from May 2020 is here).

To get a sense of the eIQ machine learning tools, I decided to compare the speed of Tensorflow Lite on MaaXBoard running Yocto vs MaaXBoard running Debian.

 

The first thing I did was to run the original Python benchmark that I ran last year on Yocto. I was pleased to see that it was noticeably faster on Yocto than on Debian.

DebianYoctoRaspberry Pi 3Raspberry Pi 4 (8GB)
364.3262.81108.5178.9

Why is it faster? The answer is delegates.

 

Delegates.

What are delegates? Delegates do exactly what they say they do: they delegate different ops to different parts of the hardware. When you build Yocto with eIQ, you specify which NXP architecture you're building for, be it i.MX8M+ with an NPU, i.MX8M Mini with quad CPUs, or even i.MX RT, with Cortex-M7 and Cortex-M4 microcontrollers. Since it knows which hardware you're building for, it's able to automatically apply delegates when machine learning inference is run. For instance, if you have a lot of matrix multiplies and your hardware has a GPU, it will automatically know to send these ops to the GPU.

 

One of the great features of delegates in eIQ is that even if you don't have specialized hardware like a GPU or DPU, the delegate acts basically as an optimizing compiler. Also, it will still be applied on a per-op basis, even if not all of the operations are able to be delegated. The delegate gets a list of nodes that are going to be executed in sequence. It looks at input tensor shapes, as well as ops for each node.

The delegate then chooses which ops to delegate. For instance, maybe the delegate only supports add operations and not multiplies.image

The runtime will partition the nodes into delegated vs. non-delegated ops. The delegate can fuse as many ops as possible to optimize inference speed. The one downside of fusing ops is that accuracy of the model may slightly suffer.image

 

In the example below, I'm running the same label_image.py file on both Debian and Yocto, as well as Raspberry Pi running 64 bit raspbian OS. Yocto has a far longer warm-up time as it sets up the NNAPI delegate, but once it's set up it's 25% faster:

DebianYoctoRaspberry Pi 4 (8GB)

(tf) ebv@maaxboard:~$ python label_image.py

Warm-up time: 210.4 ms

 

Inference time: 208.8 ms

 

0.874510: military uniform

0.031373: Windsor tie

0.015686: mortarboard

0.011765: bulletproof vest

0.007843: bow tie

root@maaxboard:/usr/bin/tensorflow-lite-2.1.0/examples# python3 label_image.py

INFO: Created TensorFlow Lite delegate for NNAPI.

Applied NNAPI delegate.

Warm-up time: 6157.1 ms

 

Inference time: 127.1 ms

 

0.670588: military uniform

0.125490: Windsor tie

0.039216: bow tie

0.027451: mortarboard

0.019608: bulletproof vest

 

 

 

 

 

 

 

 

 

 

 

pi@raspberrypi:~ $ python3 label_image.py

Warm-up time: 101.4 ms

 

Inference time: 96.0 ms

 

0.658824: military uniform

0.149020: Windsor tie

0.039216: bow tie

0.027451: mortarboard

0.019608: bulletproof vest

(Raspberry Pi 4 outperforms both of them in spite of not making use of delegates, because it has quad Cortex-A72s, which have faster performance due to 3-way superscalar out-of-order execution vs. 2-way superscalar OoO on the Cortex-A53s).

 

Since most image recognition use cases in the real world run on streaming images or video, the warmup time quickly amortizes, and the NNAPI delegate provides significant improvements.

 

But why does it delegate only on Yocto and not on MaaXBoard Debian or Raspberry Pi? The Tensorflow Lite Python API that is built into eIQ is set up to automatically delegate to NNAPI. Without eIQ, you would have to manually build the Arm NN Delegate as a library (or other delegate library) and then explicitly call the delegate from within your Python application by providing code like the following:

# Load TFLite model and allocate tensors

armnn_delegate = tflite.load_delegate(library="/usr/lib/libarmnnDelegate.so", options={"backends": "VsiNpu, CpuAcc, CpuRef", "logging-severity": "info"})

 

# Delegates/Executes all operations supported by ArmNN to/with ArmNN

interpreter = tflite.Interpreter(model_path="mobilenet_v1_1.0_224_quant.tflite", experimental_delegates=[armnn_delegate])

 

This presentation has more information about what's involved with using delegates in the Python Tensorflow API if you don't have eIQ. If you do have eIQ though, good news! eIQ actually includes several delegates, so it's possible to compare them. Below is a block diagram of the delegates that are available to Tensorflow Lite and how they fit together:

 

In my next blog, I'll cover the XNNPack delegate and how it compares to NNAPI, as well as benchmarking the C++ Tensorflow Lite API on MaaXBoard Yocto vs Debian.

  • Sign in to reply
element14 Community

element14 is the first online community specifically for engineers. Connect with your peers and get expert answers to your questions.

  • Members
  • Learn
  • Technologies
  • Challenges & Projects
  • Products
  • Store
  • About Us
  • Feedback & Support
  • FAQs
  • Terms of Use
  • Privacy Policy
  • Legal and Copyright Notices
  • Sitemap
  • Cookies

An Avnet Company © 2025 Premier Farnell Limited. All Rights Reserved.

Premier Farnell Ltd, registered in England and Wales (no 00876412), registered office: Farnell House, Forge Lane, Leeds LS12 2NE.

ICP 备案号 10220084.

Follow element14

  • X
  • Facebook
  • linkedin
  • YouTube