Songspire - Running TensorFlow Lite in a Microcontroller, such as Pico

31 Jul 2022

Hi all ! Hope everyone is well and safe.

In this post, I'm going to explain how do we run a TensorFlow model in a Microcontroller. This will be a bit boring, not many pictures, if any, but bare with me. A lot of examples !

TensorFlow Lite for Microcontrollers

TF Lite is designed to run machine learning models on microcontrollers and other devices with only a few kilobytes of memory. The core fits in 16KB and on a Arm Cortex M3 can run basic models.

Why microcontrollers are important

They are typically small, low-powered computing devices that are embedded within hardware that requires basic computation. By bringing ML to tiny devices, we can boost their intelligence.

This also preserves privacy since no data gets out of the device.

Supported platforms

TF Lite for Microcontrollers is written in C++ 11 and requires a 32-bit platform. It has been extensively tested with many processors based on the Arm Cortex-M series and has been ported to other architectures, including the ESP32.

The Framework is available as an Arduino library. It can generate projects for development environments such as Mbed.

Since it is open source, it can be included in any C++ 11 project.

The TensorFlow site lists many platforms, but not the RP2040. But it can run. We have several options, like using the Mbed platform on the Arduino IDE or just plain C++ using the PICO SDK.

Workflow

To deploy and run a TensorFlow model on a microcontroller, we need a few steps:

Train a model
- Generate a small TensorFlow model that can fit the target device and contains the supported operations
- Convert to a TensorFlow Lite model using the TensorFlow Lite converter
- Convert to a C byte array using standard tools to store it in a read-only program memory on device
Run inference on the device using the C++ library and process the results

Limitations

TensorFlow Lite for Microcontrollers is designed for the specific constrains of a microcontroller development. With a Raspberry PI, the standard TensorFlow Lite framework might be easier to integrate.

We should consider the following limitations:

Support for a limited subset of TensorFlow operations
Support for a limited set of devices
Low-level C++ API requiring manual memory management
On device training is not supported

Model optimization and deployment

Like I've already explained in a previous post - TinyML - when deploying a model into a microcontroller, because of the constraints, we need to optimize the model to run on the Arm Cortex-M0+ . This process is called model quantization.

Model quantization converts the model's weights and biases from 32-bit floating-point values to 8-bit values. The pico-tflmicro library, which is a port of the TFLu for the RP2040's Pico SDK contains Arm's CMSIS-NN library, which supports optimized kernel operations for quantized 8-bit weights on Arm Cortex-M.

Quantization aware training

To convert the model, we can use TensorFlow's Quantization Aware Training (QAT) feature to easily convert the floating-point model to quantized.

Just for knowledge, there are two forms of quantization:

post-training quantization
quantization aware training

I'm going to demonstrate quantization aware training because it offers better model accuracy and it's the one I've been using.

Quantization

When using QAT, the following steps are taken:

Train a model
Fine tune the pre-trained model with quantize aware training
Train and evaluate the quantized model
Create the quantized tflite model
Compare the accuracy of the tflite model against the TF
And finally, deploy to the device

Train a model

I'm not going to explain here how to train a model and all the specifics, but will just place some code for you to understand the steps. Just imagine we've done all the necessary Python coding. This is from an audio classifier of several sounds.

norm_layer = tf.keras.layers.experimental.preprocessing.Normalization()
norm_layer.adapt(cached_ds.map(lambda x, y, z: tf.reshape(x, input_shape)))

#define a sequencial 8 layer model
baseline_model = tf.keras.models.Sequential([
  tf.keras.layers.Input(shape=input_shape),
  tf.keras.layers.experimental.preprocessing.Resizing(32, 32, interpolation="nearest"), 
  norm_layer,
  tf.keras.layers.Conv2D(8, kernel_size=(8,8), strides=(2, 2), activation="relu"),
  tf.keras.layers.MaxPool2D(pool_size=(2,2)),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dropout(0.25),
  tf.keras.layers.Dense(50, activation='softmax')
])

print (baseline_model.summary())

METRICS = [
      "accuracy",
]

baseline_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    metrics=METRICS,
)

def scheduler(epoch, lr):
    if epoch < 100:
        return lr
    else:
        return lr * tf.math.exp(-0.1)

callbacks = [
    tf.keras.callbacks.EarlyStopping(verbose=1, patience=25), 
    tf.keras.callbacks.LearningRateScheduler(scheduler)
]

# train the model
EPOCHS = 250
history = baseline_model.fit(
    train_ds, 
    validation_data=val_ds,  
    epochs=EPOCHS,
    callbacks=callbacks,
)

and the result will be something like this:

Model: "sequential"
_________________________________________________________________
Layer (type)                Output Shape              Param #
=================================================================
resizing (Resizing)         (None, 32, 32, 1)         0

normalization (Normalizatio (None, 32, 32, 1)        3
n)

conv2d (Conv2D)             (None, 13, 13, 8)         520

max_pooling2d (MaxPooling2D (None, 6, 6, 8)          0
)

flatten (Flatten)           (None, 288)               0

dropout (Dropout)           (None, 288)               0

dense (Dense)               (None, 50)                14450

=================================================================
Total params: 14,973
Trainable params: 14,970
Non-trainable params: 3
_________________________________________________________________
None
Epoch 1/250
2022-07-31 22:37:29.104384: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
2022-07-31 22:37:29.880350: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
557/557 [==============================] - 3s 3ms/step - loss: 3.4007 - accuracy: 0.1320 - val_loss: 3.4881 - val_accuracy: 0.0980 - lr: 0.0010

Fine tune the pre-trained model with quantize aware training

Now that we've trained the model, we will apply quantization Aware training to the model and see it in the model summary

We could do this in another script by loading the trained model. TF allows us to save a model for a later time.

Load the previous model and quantized it:

final_model = tf.keras.models.load_model("fine_tuned_model")

def apply_qat_to_dense_and_cnn(layer):
    if isinstance(layer, (tf.keras.layers.Dense, tf.keras.layers.Conv2D)):
        return tfmot.quantization.keras.quantize_annotate_layer(layer)
    return layer

annotated_model = tf.keras.models.clone_model(
    fine_tune_model,
    clone_function=apply_qat_to_dense_and_cnn,
)

quant_aware_model = tfmot.quantization.keras.quantize_apply(annotated_model)
quant_aware_model.summary()

In the model summary, we can see that's the quantized model because the layers will have "quant" prefixed. For optimization purposes, we can just quantize some layers insted of all

Model: "model_1"
_________________________________________________________________
Layer (type)                Output Shape              Param #
=================================================================
input_1 (InputLayer)        [(None, 124, 129, 1)]     0

resizing (Resizing)         (None, 32, 32, 1)         0

normalization (Normalizatio (None, 32, 32, 1)        3
n)

quant_conv2d (QuantizeWrapp (None, 13, 13, 8)        539
erV2)

max_pooling2d (MaxPooling2D (None, 6, 6, 8)          0
)

flatten (Flatten)           (None, 288)               0

quant_dropout (QuantizeWrap (None, 288)              1
perV2)

quant_dense_1 (QuantizeWrap (None, 1)                294
perV2)

=================================================================
Total params: 837
Trainable params: 809
Non-trainable params: 28

Train the quantized model

quant_aware_model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
    metrics=METRICS,
)

EPOCHS=1
quant_aware_history = quant_aware_model.fit(
    train_ds, 
    validation_data=val_ds,  
    epochs=EPOCHS
)

59/59 [==============================] - 1s 4ms/step - loss: 2.1875 - accuracy: 0.5592 - val_loss: 3.3309 - val_accuracy: 0.5111

Create the quantized tflite model

For this, we will use TFLiteConverter to save a tflite model

converter = tf.lite.TFLiteConverter.from_keras_model(quant_aware_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

def representative_data_gen():
    for input_value, output_value in train_ds.unbatch().batch(1).take(100):
        # Model has only one input so each data point has one element.
        yield [input_value]
    
converter.representative_dataset = representative_data_gen
# Ensure that if any ops can't be quantized, the converter throws an error
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Set the input and output tensors to uint8 (APIs added in r2.3)
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model_quant = converter.convert()

with open("tflite_model.tflite", "wb") as f:
    f.write(tflite_model_quant)

Compare the accuracy of the tflite model against the TF

Because TF also supports loading TFLite models, we can verify the functionality of the quantized model and compare the accuracy with the normal model.

# Load the interpreter and allocate tensors
interpreter = tflite.Interpreter("tflite_model.tflite")
interpreter.allocate_tensors()

# Load input and output details
input_details = interpreter.get_input_details()[0]
output_details = interpreter.get_output_details()[0]

# Set quantization values
input_scale, input_zero_point = input_details["quantization"]
output_scale, output_zero_point = output_details["quantization"]

# Calculate the number of correct predictions
correct = 0
test_ds_len = 0

# Loop through entire test set
for x, y in test_ds.unbatch():
    # original shape is [124, 129, 1] expand to [1, 124, 129, 1]
    x = tf.expand_dims(x, 0).numpy()
  
    # quantize the input value
    if (input_scale, input_zero_point) != (0, 0):
        x = x / input_scale + input_zero_point
    x = x.astype(input_details['dtype'])

    # add the input tensor to interpreter
    interpreter.set_tensor(input_details["index"], x)
  
    #run the model
    interpreter.invoke()

    # Get output data from model and convert to fp32
    output_data = interpreter.get_tensor(output_details["index"])
    output_data = output_data.astype(np.float32)

    # Dequantize the output
    if (output_scale, output_zero_point) != (0.0, 0):
        output_data = (output_data - output_zero_point) * output_scale

    # convert output to category
    if output_data[0][0] >= 0.5:
        category = 1
    else:
        category = 0
  
    # add 1 if category = y
    correct += 1 if category == y.numpy() else 0

    test_ds_len += 1

accuracy = correct / test_ds_len
print(f"Accuracy for quantized model is {accuracy*100:.2f}% (to 2 D.P) on test set.")

and the accuracy for this model is:

Accuracy for quantized model is 50.63% (to 2 D.P) on test set.

It's bad. This was on my computer. In the Google colab, the accuracy is 94.92%... Strange...

Deploy to the device

Becasue the RP2040 MCU on the Pico does not has a built-in file system, we cannot use the .tflite file directly on the board.

Using Linux's xxd command, we can convert the .tflite file to a .h file which can then be compiled in the inference application.

echo "alignas(8) const unsigned char tflite_model[] = {" > tflite_model.h
cat tflite_model.tflite | xxd -i >> tflite_model.h
echo "};" >> tflite_model.h

and the .h file will look something like this:

alignas(8) const unsigned char tflite_model[] = {
0x20, 0x00, 0x00, 0x00, 0x54, 0x46, 0x4c, 0x33, 0x00, 0x00, 0x00, 0x00,
0x14, 0x00, 0x20, 0x00, 0x1c, 0x00, 0x18, 0x00, 0x14, 0x00, 0x10, 0x00,
0x0c, 0x00, 0x00, 0x00, 0x08, 0x00, 0x04, 0x00, 0x14, 0x00, 0x00, 0x00,
0x1c, 0x00, 0x00, 0x00, 0x88, 0x00, 0x00, 0x00, 0xe0, 0x00, 0x00, 0x00,
0x68, 0x05, 0x00, 0x00, 0x78, 0x05, 0x00, 0x00, 0xa0, 0x0f, 0x00, 0x00,
0x03, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x04, 0x00, 0x00, 0x00,
0xee, 0xf9, 0xff, 0xff, 0x0c, 0x00, 0x00, 0x00, 0x1c, 0x00, 0x00, 0x00,
0x40, 0x00, 0x00, 0x00, 0x0f, 0x00, 0x00, 0x00, 0x73, 0x65, 0x72, 0x76,
0x69, 0x6e, 0x67, 0x5f, 0x64, 0x65, 0x66, 0x61, 0x75, 0x6c, 0x74, 0x00,
0x01, 0x00, 0x00, 0x00, 0x04, 0x00, 0x00, 0x00, 0x94, 0xff, 0xff, 0xff,
0x0c, 0x00, 0x00, 0x00, 0x04, 0x00, 0x00, 0x00, 0x0d, 0x00, 0x00, 0x00,
0x71, 0x75, 0x61, 0x6e, 0x74, 0x5f, 0x64, 0x65, 0x6e, 0x73, 0x65, 0x5f,

And that's it.

Now, by using Arduino Mbed or the Pico-SDK, we can create an Application that will run inference in the board.

FIY, because I'm going to use an already pre-trained model from BirdNET and they already have a .tflite model, I will only have to convert it to .h file and program the Pico to run inference on the model.

But that will not be an easy task - record audio, split the audio and run the inference.

That will be by next topic.

Resources

https://www.tensorflow.org/model_optimization/guide/quantization/training_example

https://www.tensorflow.org/lite/microcontrollers

https://blog.tensorflow.org/2021/09/TinyML-Audio-for-everyone.html

https://github.com/raspberrypi/pico-tflmicro