One of the technical areas that I am interested in is object detection in video streams. The specific application is the real-time identification of objects in video from IP surveillance cameras. As part of the PYNQ-Z2 roadtest I want to see how well an FPGA implementation of a neural network works for this task. A key to an efficient implementation (power and area) in programmable logic is quantization. I came across an interesting paper on quantization while researching the PYNQ-Z2; FINN: A Framework for Fast, Scalable Binarized Neural Network Inference at this link: https://arxiv.org/abs/1612.07119. The paper describes a framework for building fast and flexible FPGA accelerators using Binarized Neural Networks (BNNs). The good news for me is that there have been example overlays built for PYNQ for BNNs and QNNs (Quantized Neural Networks that use single bit weights and multi bit (2 or 3) bit activations). In the time remaining for this roadtest I don't think that I can learn the Vivado flow well enough to train and build my own network model but hopefully I can test and tweak the example overlays to run with my own video streams.
The first step is to try running through the available examples and I'll do that in this post. The examples are from the following GitHub repositories:
The second repository on the list has examples of both BNN and QNN with two different hardware architectures.
- Feed-forward Dataflow: all layers of the network are implemented in the hardware, the output of one layer is the input of the following one that starts processing as soon as data is available. The network parameters for all layers are cached in the on-chip memory. For each network topology, a customized hardware implementation is generated that provides low latency and high throughput.
- Multi-layer offload: a fixed hardware architecture is implemented, being able to compute multiple layers in a single call. The complete network is executed in multiple calls, which are scheduled on the same hardware architecture. Changing the network topology implies changing the runtime scheduling, but not the hardware architecture. This provides a flexible implementation but features slightly higher latency.
Available examples:
The BNN based notebooks with dataflow are:
- Cifar10: shows a convolutional neural network, composed of 6 convolutional, 3 max pool and 3 fully connected layers trained on the Cifar10 dataset
- SVHN: shows a convolutional neural network, composed of 6 convolutional, 3 max pool and 3 fully connected layers trained on the Street View House Number dataset
- GTRSB: shows a convolutional neural network, composed of 6 convolutional, 3 max pool and 3 fully connected layers trained on the German Road Sign dataset
- MNIST: shows a multi layer perceptron with 3 fully connected layers trained on the MNIST dataset for digit recognition
The QNN based notebooks with multi-layer offload are:
- ImageNet Classification: shows an example on how to classify a non-labelled image (e.g., downloaded from the web, your phone etc) in one of the 1000 classes available on the ImageNet dataset.
- ImageNet - Dataset validation: shows an example classifying labelled image (i.e., extracted from the dataset) in one of the 1000 classes available on the ImageNet dataset.
- ImageNet - Dataset validation in a loop: shows an example classifying labelled image (i.e., extracted from the dataset) in one of the 1000 classes available on the ImageNet dataset in a loop.
- Object Detection - from image: shows object detection in a image (e.g., downloaded from the web, your phone etc), being able to identify objects in a scene and drawing bounding boxes around them. The objects can be one of the 20 available in the PASCAL VOC dataset
- Object Detection - from image in a loop: shows object detection in a image and draws bounding boxes around identified objects (20 available classes from PASCAL VOC dataset) in a loop.
I'll run through all the example notebooks in this depository, but the one that is most applicable to what I want to do is the last notebook that uses a variant of Tiny-Yolo to do object detection within an image. I'll need to adapt that to do object detection within a video stream. The examples are fairly detailed so I'll pick a couple of notebooks from each category and do a separate post for each one.