element14 Community
element14 Community
    Register Log In
  • Site
  • Search
  • Log In Register
  • About Us
  • Community Hub
    Community Hub
    • What's New on element14
    • Feedback and Support
    • Benefits of Membership
    • Personal Blogs
    • Members Area
    • Achievement Levels
  • Learn
    Learn
    • Ask an Expert
    • eBooks
    • element14 presents
    • Learning Center
    • Tech Spotlight
    • STEM Academy
    • Webinars, Training and Events
    • Learning Groups
  • Technologies
    Technologies
    • 3D Printing
    • FPGA
    • Industrial Automation
    • Internet of Things
    • Power & Energy
    • Sensors
    • Technology Groups
  • Challenges & Projects
    Challenges & Projects
    • Design Challenges
    • element14 presents Projects
    • Project14
    • Arduino Projects
    • Raspberry Pi Projects
    • Project Groups
  • Products
    Products
    • Arduino
    • Avnet Boards Community
    • Dev Tools
    • Manufacturers
    • Multicomp Pro
    • Product Groups
    • Raspberry Pi
    • RoadTests & Reviews
  • Store
    Store
    • Visit Your Store
    • Choose another store...
      • Europe
      •  Austria (German)
      •  Belgium (Dutch, French)
      •  Bulgaria (Bulgarian)
      •  Czech Republic (Czech)
      •  Denmark (Danish)
      •  Estonia (Estonian)
      •  Finland (Finnish)
      •  France (French)
      •  Germany (German)
      •  Hungary (Hungarian)
      •  Ireland
      •  Israel
      •  Italy (Italian)
      •  Latvia (Latvian)
      •  
      •  Lithuania (Lithuanian)
      •  Netherlands (Dutch)
      •  Norway (Norwegian)
      •  Poland (Polish)
      •  Portugal (Portuguese)
      •  Romania (Romanian)
      •  Russia (Russian)
      •  Slovakia (Slovak)
      •  Slovenia (Slovenian)
      •  Spain (Spanish)
      •  Sweden (Swedish)
      •  Switzerland(German, French)
      •  Turkey (Turkish)
      •  United Kingdom
      • Asia Pacific
      •  Australia
      •  China
      •  Hong Kong
      •  India
      •  Korea (Korean)
      •  Malaysia
      •  New Zealand
      •  Philippines
      •  Singapore
      •  Taiwan
      •  Thailand (Thai)
      • Americas
      •  Brazil (Portuguese)
      •  Canada
      •  Mexico (Spanish)
      •  United States
      Can't find the country/region you're looking for? Visit our export site or find a local distributor.
  • Translate
  • Profile
  • Settings
Personal Blogs
  • Community Hub
  • More
Personal Blogs
Andy Clark's Blog Using a hash function Tokenization during Text Classification on memory constrained devices
  • Blog
  • Documents
  • Mentions
  • Sub-Groups
  • Tags
  • More
  • Cancel
  • New
  • Share
  • More
  • Cancel
Group Actions
  • Group RSS
  • More
  • Cancel
Engagement
  • Author Author: Workshopshed
  • Date Created: 13 May 2020 8:28 PM Date Created
  • Views 1245 views
  • Likes 5 likes
  • Comments 0 comments
  • tensor flow lite
  • machine learning
Related
Recommended

Using a hash function Tokenization during Text Classification on memory constrained devices

Workshopshed
Workshopshed
13 May 2020

I've recently been looking at machine learning techniques for analysing data, in particularly text classification and sentiment analysis. For this I'm using Tensor Flow Lite which can be run against edge devices such as microcontrollers. Thanks to Simone Salerno, I've got this running on an Arduino MKR ZeroArduino MKR Zero using his EloquentTinyML library.

 

But there are some challenges with processing text. To get the best results some processing is recommended.

image

  • The raw text is filtered and cleansed, things like emojis and punctuation are removed.
  • The text is then pre-processed for example all white spaces are converted to single spaces and upper case letters are made lower case.
  • Then comes tokenization, this is a 2 step process, firstly the data is split into individual words, then each word is converted to a number. This is needed as Tensor Flow processes numbers not text. Typically this step is done with a big lookup dictionary which is generated at the time of training the model.
  • Another requirement for Tensor Flow is that the input array is a consistent size, again this is set at the time of training. To complete this the data is padded out to be a consistent length.
  • Finally the resulting data set is either passed to the model for training or once on the device it is passed the model for prediction and hence to get the results of the process.

 

Most of these steps are no problem for the board which has a SAMD21 Cortex-M0+ 32bit processor running at 48 Mhz. However, the tokenization step would need a big lookup list to produce all the different data values, something that would not fit into the memory of the microcontroller. o that got me thinking about code that turns strings of letters into numbers. A hash function.

 

I found Paul Hsieh's "SuperFastHash" which had versions in python and C for the Arduino. I managed to fit that into the pipeline via a custom TextEncoder. However, that was producing very large values which in turn meant that the Embedding layer of the neural network was also very large. By clipping the output of the hash so it produced a smaller range of values that also allowed the embeddling layer to be reduced.

 

After some experimentation with different network parameters and unsuccessfully the model optimisation, I did manage to produce some models that would compile and hence fit into the flash of the device.

 

You can see my work in progress at https://github.com/Workshopshed/TinyMLTextClassification

 

Reference

https://github.com/SukkoPera/Arduino-Rokkit-Hash

https://github.com/JRBANCEL/PySuperFastHash

https://blog.tensorflow.org/2019/06/introducing-tftext.html

https://gist.github.com/Mageswaran1989/70fd26af52ca4afb86e611f84ac83e97#file-text_preprocessing-ipynb

  • Sign in to reply
element14 Community

element14 is the first online community specifically for engineers. Connect with your peers and get expert answers to your questions.

  • Members
  • Learn
  • Technologies
  • Challenges & Projects
  • Products
  • Store
  • About Us
  • Feedback & Support
  • FAQs
  • Terms of Use
  • Privacy Policy
  • Legal and Copyright Notices
  • Sitemap
  • Cookies

An Avnet Company © 2025 Premier Farnell Limited. All Rights Reserved.

Premier Farnell Ltd, registered in England and Wales (no 00876412), registered office: Farnell House, Forge Lane, Leeds LS12 2NE.

ICP 备案号 10220084.

Follow element14

  • X
  • Facebook
  • linkedin
  • YouTube