Glow Compiler Optimizes Neural Networks for Low-Power NXP MCUs

The world of machine learning, and more specifically deep learning, is a rapidly growing field. It’s growing in the sense of how quickly the market is expanding, especially as deep learning moves to the edge. In my microcosm at NXP, I see the deep learning customer base increasing dramatically, as more and more engineers build applications that include some form of vision- or voice-based machine learning technology. The number of deep learning frameworks, tools and other capabilities that allow developers to build and deploy neural network models are also expanding.

One example of such a tool is the Glow neural network (NN) model compiler. Aligned with the proliferation of deep learning frameworks such as PyTorch, NN compilers provide optimizations to accelerate inferencing on a range of hardware platforms. In May 2018, Facebook introduced Glow (the graph lowering compiler) as an open source community project and it has evolved significantly over the last two years thanks to the efforts of more than 130 worldwide contributors.

Recently, we rolled out our official support for this Glow compiler and we’re very excited about the performance and memory benefits it’s delivering for our devices. We have tightly integrated Glow into our MCUXpresso SDK, which packages the Glow compiler and quantization tools into an easy-to-use installer along with detailed documentation and labs to get running quickly with your own models.

Glow Flexible Functionality

As an NN compiler, Glow takes in a computation graph and generates optimized machine code over two phases. In the first phase, it optimizes the operators and layers of the model using standard compiler techniques such as kernel fusion, lowering of complex operations to simple kernels and transpose elimination. In the second, or backend phase of the model compilation, the Glow compiler uses LLVM modules to enable target-specific optimizations. Glow supports ahead of time (AOT) compilation, where compilation is performed offline to generate an object file (called a Glow bundle) which is later linked with the user’s application code. When this object file is generated, all unnecessary overhead is eliminated, reducing the number of computations as well as the memory overhead. This is ideal for deploying on memory-constrained and low-cost microcontrollers.

Target-Specific Optimizations

While any device in our i.MX RT series will run a Glow compiled model, we started our testing on the i.MX RT1060 MCU because we also have TensorFlow running on this device, and it allowed us to have a direct performance comparison. We also started with the i.MX RT685 MCU because this is a new device and the only one in our i.MX RT series with a DSP optimized for processing neural network operators. The i.MX RT1060 MCU contains a 600 MHz Arm Cortex-M7 and 1MB of SRAM. The i.MX RT685 MCU contains a 600 MHz Cadence Tensilica HiFi 4 DSP core paired with a 300 MHz Cortex-M33 core and 4.5 MB of on-chip SRAM.

The standard version of Glow from GitHub is device agnostic; it can compile for basic architectures of interest. For example, for cross-compiling a bundle for the Arm Cortex-M7 core, use the command line –target=arm -mcpu=cortex-m7. However, as I mentioned, Glow’s LLVM backend support can cross-compile bundles for different target architectures. NXP has taken advantage of this by using Arm CMSIS-NN to leverage the full capability of the Cortex-M7 as well as the memory subsystem of the i.MX RT1060 device. CMSIS-NN is an Arm-developed library supporting Arm Cortex-M0, -M3, -M4, -M7 and -M33 cores, and it implements standard NN operations like convolution, fully connected, pooling and activation. Simply use the compilation flag -use-cmsis when building quantized bundles, and the performance will increase significantly above the standard compilation. For example, as measured by NXP on a CIFAR-10 model, performance increases by almost 2x, when using the CMSIS-NN library to accelerate NN operations.

The HiFi 4 DSP core on the i.MX RT685 device is also capable of accelerating a wide range of NN operators when used with Cadence’s NN library (NNLib) as another LLVM backend for Glow. NN Lib is like CMSIS-NN, except it provides a much more comprehensive set of hand-tuned operators optimized for the HiFi 4 DSP. For the same CIFAR-10 example, this DSP delivers a 25x performance increase compared to the Glow standard implementation.

PyTorch for Embedded Systems

PyTorch can directly export models into the ONNX format for use by Glow. Alternatively, since many well-known models were created in other formats (e.g. TensorFlow), there are also open source model conversion tools to convert them to the ONNX format. The most used tools for format conversion are MMDNN, a set of tools supported by Microsoft to help users inter-operate among different deep learning frameworks, and tf2onnx to convert TensorFlow models to ONNX. Furthermore, NXP has upstreamed to the Glow community a support feature to bring TensorFlow Lite models in directly to Glow. More recently, Glow can be directly accessed through PyTorch, allowing users to build and compile their models in the same development environment, thereby eliminating steps and simplifying the compilation process.

However, because of its broad use in datacenters by companies such as Facebook, people have questioned PyTorch’s ability to serve as a framework for embedded MCUs. With Glow becoming directly accessible from PyTorch, is there reason to be concerned that ’PyTorch, and hence Glow, is not targeted at MCUs?’ The short answer is ‘no’, especially given the AOT implementation of Glow.

To explain this further, it would be a valid impression that PyTorch itself isn’t targeted towards MCUs. It’s a community project and no one has stepped up yet to develop and maintain this approach. For obvious reasons, Facebook does not use PyTorch on MCUs, but the embedded community is welcome to contribute and add the end-to-end support for MCUs and embedded platforms in general. I suspect that this is only a matter of time because of the growing attraction to PyTorch, especially among academic and research users. According to statistics¹, PyTorch’s dominance is strongest at vision and language conferences (outnumbering TensorFlow by 2:1 and 3:1 respectively), and PyTorch has also become more popular than TensorFlow at general machine learning conferences like ICLR and ICML. Eventually some of these researchers will migrate into the industrial space and adapt PyTorch for the edge computing environment.

To specifically address the question of PyTorch as a good choice for MCUs, since it can generate ONNX models which can be compiled by Glow, processing platform restrictions are minimal. And with Glow as an extension of PyTorch, it will be even easier to generate bundles. The user can generate bundles directly from the Python script, without having to first generate ONNX models. Soon, NXP will release an application note that provides guidance to create and deploy a model using PyTorch and Glow.

We’ve recently rolled out our official support for Glow. In fact, it’s tightly integrated into our MCUXpresso SDK with several project examples. We’ve also packaged the Glow compile and quantization tools into an easy-to-use installer with detailed documentation and labs to help you get up and running quickly with your own models. With such significant performance and memory benefits, this compiler will be a great boon for embedded system developers deploying machine learning with the i.MX RT series of crossover MCUs from NXP.

Want to Learn More?

Read the press release

Explore our eIQ Machine Learning software development environment

Visit Medium.com/pytorch

¹ https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-research-tensorflow-dominates-industry/