Using a hash function Tokenization during Text Classification on memory constrained devices

13 May 2020

I've recently been looking at machine learning techniques for analysing data, in particularly text classification and sentiment analysis. For this I'm using Tensor Flow Lite which can be run against edge devices such as microcontrollers. Thanks to Simone Salerno, I've got this running on an Arduino MKR ZeroArduino MKR Zero using his EloquentTinyML library.

But there are some challenges with processing text. To get the best results some processing is recommended.

The raw text is filtered and cleansed, things like emojis and punctuation are removed.
The text is then pre-processed for example all white spaces are converted to single spaces and upper case letters are made lower case.
Then comes tokenization, this is a 2 step process, firstly the data is split into individual words, then each word is converted to a number. This is needed as Tensor Flow processes numbers not text. Typically this step is done with a big lookup dictionary which is generated at the time of training the model.
Another requirement for Tensor Flow is that the input array is a consistent size, again this is set at the time of training. To complete this the data is padded out to be a consistent length.
Finally the resulting data set is either passed to the model for training or once on the device it is passed the model for prediction and hence to get the results of the process.

Most of these steps are no problem for the board which has a SAMD21 Cortex-M0+ 32bit processor running at 48 Mhz. However, the tokenization step would need a big lookup list to produce all the different data values, something that would not fit into the memory of the microcontroller. o that got me thinking about code that turns strings of letters into numbers. A hash function.

I found Paul Hsieh's "SuperFastHash" which had versions in python and C for the Arduino. I managed to fit that into the pipeline via a custom TextEncoder. However, that was producing very large values which in turn meant that the Embedding layer of the neural network was also very large. By clipping the output of the hash so it produced a smaller range of values that also allowed the embeddling layer to be reduced.

After some experimentation with different network parameters and unsuccessfully the model optimisation, I did manage to produce some models that would compile and hence fit into the flash of the device.

You can see my work in progress at https://github.com/Workshopshed/TinyMLTextClassification

Reference

https://github.com/SukkoPera/Arduino-Rokkit-Hash

https://github.com/JRBANCEL/PySuperFastHash

https://blog.tensorflow.org/2019/06/introducing-tftext.html

https://gist.github.com/Mageswaran1989/70fd26af52ca4afb86e611f84ac83e97#file-text_preprocessing-ipynb