Hi all ! Hope everyone is well and safe.
In this post, I'm going to explain how do we run a TensorFlow model in a Microcontroller. This will be a bit boring, not many pictures, if any, but bare with me. A lot of examples !
TensorFlow Lite for Microcontrollers
TF Lite is designed to run machine learning models on microcontrollers and other devices with only a few kilobytes of memory. The core fits in 16KB and on a Arm Cortex M3 can run basic models.
Why microcontrollers are important
They are typically small, low-powered computing devices that are embedded within hardware that requires basic computation. By bringing ML to tiny devices, we can boost their intelligence.
This also preserves privacy since no data gets out of the device.
Supported platforms
TF Lite for Microcontrollers is written in C++ 11 and requires a 32-bit platform. It has been extensively tested with many processors based on the Arm Cortex-M series and has been ported to other architectures, including the ESP32.
The Framework is available as an Arduino library. It can generate projects for development environments such as Mbed.
Since it is open source, it can be included in any C++ 11 project.
The TensorFlow site lists many platforms, but not the RP2040. But it can run. We have several options, like using the Mbed platform on the Arduino IDE or just plain C++ using the PICO SDK.
Workflow
To deploy and run a TensorFlow model on a microcontroller, we need a few steps:
- Train a model
- Generate a small TensorFlow model that can fit the target device and contains the supported operations
- Convert to a TensorFlow Lite model using the TensorFlow Lite converter
- Convert to a C byte array using standard tools to store it in a read-only program memory on device
- Run inference on the device using the C++ library and process the results
Limitations
TensorFlow Lite for Microcontrollers is designed for the specific constrains of a microcontroller development. With a Raspberry PI, the standard TensorFlow Lite framework might be easier to integrate.
We should consider the following limitations:
- Support for a limited subset of TensorFlow operations
- Support for a limited set of devices
- Low-level C++ API requiring manual memory management
- On device training is not supported
Model optimization and deployment
Like I've already explained in a previous post - TinyML - when deploying a model into a microcontroller, because of the constraints, we need to optimize the model to run on the Arm Cortex-M0+ . This process is called model quantization.
Model quantization converts the model's weights and biases from 32-bit floating-point values to 8-bit values. The pico-tflmicro library, which is a port of the TFLu for the RP2040's Pico SDK contains Arm's CMSIS-NN library, which supports optimized kernel operations for quantized 8-bit weights on Arm Cortex-M.
Quantization aware training
To convert the model, we can use TensorFlow's Quantization Aware Training (QAT) feature to easily convert the floating-point model to quantized.
Just for knowledge, there are two forms of quantization:
- post-training quantization
- quantization aware training
I'm going to demonstrate quantization aware training because it offers better model accuracy and it's the one I've been using.
Quantization
When using QAT, the following steps are taken:
- Train a model
- Fine tune the pre-trained model with quantize aware training
- Train and evaluate the quantized model
- Create the quantized tflite model
- Compare the accuracy of the tflite model against the TF
- And finally, deploy to the device
Train a model
I'm not going to explain here how to train a model and all the specifics, but will just place some code for you to understand the steps. Just imagine we've done all the necessary Python coding. This is from an audio classifier of several sounds.
norm_layer = tf.keras.layers.experimental.preprocessing.Normalization() norm_layer.adapt(cached_ds.map(lambda x, y, z: tf.reshape(x, input_shape))) #define a sequencial 8 layer model baseline_model = tf.keras.models.Sequential([ tf.keras.layers.Input(shape=input_shape), tf.keras.layers.experimental.preprocessing.Resizing(32, 32, interpolation="nearest"), norm_layer, tf.keras.layers.Conv2D(8, kernel_size=(8,8), strides=(2, 2), activation="relu"), tf.keras.layers.MaxPool2D(pool_size=(2,2)), tf.keras.layers.Flatten(), tf.keras.layers.Dropout(0.25), tf.keras.layers.Dense(50, activation='softmax') ]) print (baseline_model.summary()) METRICS = [ "accuracy", ] baseline_model.compile( optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics=METRICS, ) def scheduler(epoch, lr): if epoch < 100: return lr else: return lr * tf.math.exp(-0.1) callbacks = [ tf.keras.callbacks.EarlyStopping(verbose=1, patience=25), tf.keras.callbacks.LearningRateScheduler(scheduler) ] # train the model EPOCHS = 250 history = baseline_model.fit( train_ds, validation_data=val_ds, epochs=EPOCHS, callbacks=callbacks, )
and the result will be something like this:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
resizing (Resizing) (None, 32, 32, 1) 0
normalization (Normalizatio (None, 32, 32, 1) 3
n)
conv2d (Conv2D) (None, 13, 13, 8) 520
max_pooling2d (MaxPooling2D (None, 6, 6, 8) 0
)
flatten (Flatten) (None, 288) 0
dropout (Dropout) (None, 288) 0
dense (Dense) (None, 50) 14450
=================================================================
Total params: 14,973
Trainable params: 14,970
Non-trainable params: 3
_________________________________________________________________
None
Epoch 1/250
2022-07-31 22:37:29.104384: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
2022-07-31 22:37:29.880350: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
557/557 [==============================] - 3s 3ms/step - loss: 3.4007 - accuracy: 0.1320 - val_loss: 3.4881 - val_accuracy: 0.0980 - lr: 0.0010
Fine tune the pre-trained model with quantize aware training
Now that we've trained the model, we will apply quantization Aware training to the model and see it in the model summary
We could do this in another script by loading the trained model. TF allows us to save a model for a later time.
Load the previous model and quantized it:
final_model = tf.keras.models.load_model("fine_tuned_model") def apply_qat_to_dense_and_cnn(layer): if isinstance(layer, (tf.keras.layers.Dense, tf.keras.layers.Conv2D)): return tfmot.quantization.keras.quantize_annotate_layer(layer) return layer annotated_model = tf.keras.models.clone_model( fine_tune_model, clone_function=apply_qat_to_dense_and_cnn, ) quant_aware_model = tfmot.quantization.keras.quantize_apply(annotated_model) quant_aware_model.summary()
In the model summary, we can see that's the quantized model because the layers will have "quant" prefixed. For optimization purposes, we can just quantize some layers insted of all
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 124, 129, 1)] 0
resizing (Resizing) (None, 32, 32, 1) 0
normalization (Normalizatio (None, 32, 32, 1) 3
n)
quant_conv2d (QuantizeWrapp (None, 13, 13, 8) 539
erV2)
max_pooling2d (MaxPooling2D (None, 6, 6, 8) 0
)
flatten (Flatten) (None, 288) 0
quant_dropout (QuantizeWrap (None, 288) 1
perV2)
quant_dense_1 (QuantizeWrap (None, 1) 294
perV2)
=================================================================
Total params: 837
Trainable params: 809
Non-trainable params: 28
Train the quantized model
quant_aware_model.compile( optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.BinaryCrossentropy(from_logits=False), metrics=METRICS, ) EPOCHS=1 quant_aware_history = quant_aware_model.fit( train_ds, validation_data=val_ds, epochs=EPOCHS )
59/59 [==============================] - 1s 4ms/step - loss: 2.1875 - accuracy: 0.5592 - val_loss: 3.3309 - val_accuracy: 0.5111
Create the quantized tflite model
For this, we will use TFLiteConverter to save a tflite model
converter = tf.lite.TFLiteConverter.from_keras_model(quant_aware_model) converter.optimizations = [tf.lite.Optimize.DEFAULT] def representative_data_gen(): for input_value, output_value in train_ds.unbatch().batch(1).take(100): # Model has only one input so each data point has one element. yield [input_value] converter.representative_dataset = representative_data_gen # Ensure that if any ops can't be quantized, the converter throws an error converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] # Set the input and output tensors to uint8 (APIs added in r2.3) converter.inference_input_type = tf.int8 converter.inference_output_type = tf.int8 tflite_model_quant = converter.convert() with open("tflite_model.tflite", "wb") as f: f.write(tflite_model_quant)
Compare the accuracy of the tflite model against the TF
Because TF also supports loading TFLite models, we can verify the functionality of the quantized model and compare the accuracy with the normal model.
# Load the interpreter and allocate tensors interpreter = tflite.Interpreter("tflite_model.tflite") interpreter.allocate_tensors() # Load input and output details input_details = interpreter.get_input_details()[0] output_details = interpreter.get_output_details()[0] # Set quantization values input_scale, input_zero_point = input_details["quantization"] output_scale, output_zero_point = output_details["quantization"] # Calculate the number of correct predictions correct = 0 test_ds_len = 0 # Loop through entire test set for x, y in test_ds.unbatch(): # original shape is [124, 129, 1] expand to [1, 124, 129, 1] x = tf.expand_dims(x, 0).numpy() # quantize the input value if (input_scale, input_zero_point) != (0, 0): x = x / input_scale + input_zero_point x = x.astype(input_details['dtype']) # add the input tensor to interpreter interpreter.set_tensor(input_details["index"], x) #run the model interpreter.invoke() # Get output data from model and convert to fp32 output_data = interpreter.get_tensor(output_details["index"]) output_data = output_data.astype(np.float32) # Dequantize the output if (output_scale, output_zero_point) != (0.0, 0): output_data = (output_data - output_zero_point) * output_scale # convert output to category if output_data[0][0] >= 0.5: category = 1 else: category = 0 # add 1 if category = y correct += 1 if category == y.numpy() else 0 test_ds_len += 1 accuracy = correct / test_ds_len print(f"Accuracy for quantized model is {accuracy*100:.2f}% (to 2 D.P) on test set.")
and the accuracy for this model is:
Accuracy for quantized model is 50.63% (to 2 D.P) on test set.
It's bad. This was on my computer. In the Google colab, the accuracy is 94.92%... Strange...
Deploy to the device
Becasue the RP2040 MCU on the Pico does not has a built-in file system, we cannot use the .tflite file directly on the board.
Using Linux's xxd command, we can convert the .tflite file to a .h file which can then be compiled in the inference application.
echo "alignas(8) const unsigned char tflite_model[] = {" > tflite_model.h cat tflite_model.tflite | xxd -i >> tflite_model.h echo "};" >> tflite_model.h
and the .h file will look something like this:
alignas(8) const unsigned char tflite_model[] = {
0x20, 0x00, 0x00, 0x00, 0x54, 0x46, 0x4c, 0x33, 0x00, 0x00, 0x00, 0x00,
0x14, 0x00, 0x20, 0x00, 0x1c, 0x00, 0x18, 0x00, 0x14, 0x00, 0x10, 0x00,
0x0c, 0x00, 0x00, 0x00, 0x08, 0x00, 0x04, 0x00, 0x14, 0x00, 0x00, 0x00,
0x1c, 0x00, 0x00, 0x00, 0x88, 0x00, 0x00, 0x00, 0xe0, 0x00, 0x00, 0x00,
0x68, 0x05, 0x00, 0x00, 0x78, 0x05, 0x00, 0x00, 0xa0, 0x0f, 0x00, 0x00,
0x03, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x04, 0x00, 0x00, 0x00,
0xee, 0xf9, 0xff, 0xff, 0x0c, 0x00, 0x00, 0x00, 0x1c, 0x00, 0x00, 0x00,
0x40, 0x00, 0x00, 0x00, 0x0f, 0x00, 0x00, 0x00, 0x73, 0x65, 0x72, 0x76,
0x69, 0x6e, 0x67, 0x5f, 0x64, 0x65, 0x66, 0x61, 0x75, 0x6c, 0x74, 0x00,
0x01, 0x00, 0x00, 0x00, 0x04, 0x00, 0x00, 0x00, 0x94, 0xff, 0xff, 0xff,
0x0c, 0x00, 0x00, 0x00, 0x04, 0x00, 0x00, 0x00, 0x0d, 0x00, 0x00, 0x00,
0x71, 0x75, 0x61, 0x6e, 0x74, 0x5f, 0x64, 0x65, 0x6e, 0x73, 0x65, 0x5f,
And that's it.
Now, by using Arduino Mbed or the Pico-SDK, we can create an Application that will run inference in the board.
FIY, because I'm going to use an already pre-trained model from BirdNET and they already have a .tflite model, I will only have to convert it to .h file and program the Pico to run inference on the model.
But that will not be an easy task - record audio, split the audio and run the inference.
That will be by next topic.
Resources
https://www.tensorflow.org/model_optimization/guide/quantization/training_example
https://www.tensorflow.org/lite/microcontrollers
https://blog.tensorflow.org/2021/09/TinyML-Audio-for-everyone.html