PYNQ-Z2 Dev Kit - Tiny-YOLO Object Detection

18 Aug 2019

The next neural network that I'm going to try is a variant of Tiny-YOLO. The You Only Look Once (YOLO) architecture was developed to create a one step process for detection and classification. The image is divided into a fixed grid of uniform cells and bounding boxes are predicted and classified within each cell. This architecture enables faster object detection and has been applied to streaming video.

The network topology is shown below. The pink colored layers have been quantized with 1 bit for weights and 3 bit for activations, and will be executed in the HW accelerator, while the other layers are executed in python.

The image processing is performed within Darknet by using python bindings.

The neural network has been trained on the PASCAL VOC (Visual Object Classes) and is able to identify 20 classes of objects in 4 categories

Person: person
Animal: bird, cat, cow, dog, horse, sheep
Vehicle: airplane, bicycle, boat, bus, car, motorbike, train
Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

The steps for detection and classification are similar to the previous network as this network also uses the Multi-layer offload architecture.

Initialize the network

Import libraries
Instantiate classifier
Perform other initializations in the Darknet framework

Code for initialization:

import sys
import os, platform
import json
import numpy as np
import cv2
import ctypes

from PIL import Image
from datetime import datetime

import qnn
from qnn import TinierYolo
from qnn import utils 
sys.path.append("/opt/darknet/python/")
from darknet import *

from matplotlib import pyplot as plt
%matplotlib inline

classifier = TinierYolo()
classifier.init_accelerator()
net = classifier.load_network(json_layer="/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-layers.json")

conv0_weights = np.load('/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-conv0-W.npy', encoding="latin1")
conv0_weights_correct = np.transpose(conv0_weights, axes=(3, 2, 1, 0))
conv8_weights = np.load('/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-conv8-W.npy', encoding="latin1")
conv8_weights_correct = np.transpose(conv8_weights, axes=(3, 2, 1, 0))
conv0_bias = np.load('/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-conv0-bias.npy', encoding="latin1")
conv0_bias_broadcast = np.broadcast_to(conv0_bias[:,np.newaxis], (net['conv1']['input'][0],net['conv1']['input'][1]*net['conv1']['input'][1]))
conv8_bias = np.load('/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-conv8-bias.npy', encoding="latin1")
conv8_bias_broadcast = np.broadcast_to(conv8_bias[:,np.newaxis], (125,13*13))
file_name_cfg = c_char_p("/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-bwn-3bit-relu-nomaxpool.cfg".encode())

net_darknet = lib.parse_network_cfg(file_name_cfg)

Classify image

Open image to be classified
Execute the first convolutional layer in Python
Compute HW Offload of the quantized layers
Normalize using fully connected layers in python

Code for classification:

img_folder = './yoloimages/'
img_file = os.path.join(img_folder, random.choice(os.listdir(img_folder)))
file_name = c_char_p(img_file.encode())

img = load_image(file_name,0,0)
img_letterbox = letterbox_image(img,416,416)
img_copy = np.copy(np.ctypeslib.as_array(img_letterbox.data, (3,416,416)))
img_copy = np.swapaxes(img_copy, 0,2)
free_image(img)
free_image(img_letterbox)

im = Image.open(img_file)
im

start = datetime.now()
img_copy = img_copy[np.newaxis, :, :, :]
    
conv0_ouput = utils.conv_layer(img_copy,conv0_weights_correct,b=conv0_bias_broadcast,stride=2,padding=1)
conv0_output_quant = conv0_ouput.clip(0.0,4.0)
conv0_output_quant = utils.quantize(conv0_output_quant/4,3)
end = datetime.now()
micros = int((end - start).total_seconds() * 1000000)
print("First layer SW implementation took {} microseconds".format(micros))
print(micros, file=open('timestamp.txt', 'w'))

out_dim = net['conv7']['output'][1]
out_ch = net['conv7']['output'][0]

conv_output = classifier.get_accel_buffer(out_ch, out_dim)
conv_input = classifier.prepare_buffer(conv0_output_quant*7);

start = datetime.now()
classifier.inference(conv_input, conv_output)
end = datetime.now()

conv7_out = classifier.postprocess_buffer(conv_output)

micros = int((end - start).total_seconds() * 1000000)
print("HW implementation took {} microseconds".format(micros))
print(micros, file=open('timestamp.txt', 'a'))

start = datetime.now()
conv7_out_reshaped = conv7_out.reshape(out_dim,out_dim,out_ch)
conv7_out_swapped = np.swapaxes(conv7_out_reshaped, 0, 1) # exp 1
conv7_out_swapped = conv7_out_swapped[np.newaxis, :, :, :] 

conv8_output = utils.conv_layer(conv7_out_swapped,conv8_weights_correct,b=conv8_bias_broadcast,stride=1)  
conv8_out = conv8_output.ctypes.data_as(ctypes.POINTER(ctypes.c_float))

end = datetime.now()
micros = int((end - start).total_seconds() * 1000000)
print("Last layer SW implementation took {} microseconds".format(micros))
print(micros, file=open('timestamp.txt', 'a'))

Draw detection boxes using Darknet

The image postprocessing (drawing the bounding boxes) is performed in darknet using python bindings

Code for image postprocessing:

lib.forward_region_layer_pointer_nolayer(net_darknet,conv8_out)
tresh = c_float(0.3)
tresh_hier = c_float(0.5)
file_name_out = c_char_p("/home/xilinx/jupyter_notebooks/qnn/detection".encode())
file_name_probs = c_char_p("/home/xilinx/jupyter_notebooks/qnn/probabilities.txt".encode())
file_names_voc = c_char_p("/opt/darknet/data/voc.names".encode())
darknet_path = c_char_p("/opt/darknet/".encode())
lib.draw_detection_python(net_darknet, file_name, tresh, tresh_hier,file_names_voc, darknet_path, file_name_out, file_name_probs);

#Print probabilities
file_content = open(file_name_probs.value,"r").read().splitlines()
detections = []
for line in file_content[0:]:
    name, probability = line.split(": ")
    detections.append((probability, name))
for det in sorted(detections, key=lambda tup: tup[0], reverse=True):
    print("class: {}\tprobability: {}".format(det[1], det[0]))

Sample image (horses)

The first image that I going to use is a provided sample image of horses (773 x 512 pixels)

Execution time:

First layer SW implementation took 594523 microseconds
HW implementation took 593735 microseconds
Last layer SW implementation took 68420 microseconds

Classification:

Object detection bounding boxes:

The example shows the issues that occur with multiple overlapping objects.

IP camera images

The application that I would like to use neural networks for is object identification in video streams from surveillance cameras. As an example, I have an PTZ IP camera at the front of my house that is primarily used to alert me to deliveries (mail, Amazon, UPS, etc). It is normally pointed at the driveway and mailbox, but the pan/tilt capability allows me to look up and down the street and also at my front door (270 degrees of coverage). Currently, image motion detection and PIR sensing tell me when something is detected but I need to look at the camera video to determine if it is something of interest. And needless to say, there are a lot of false detections. I have 2 video sources that I'd like to analyze, the live fed from the camera and also stored video from a network video recorder (NVR). I have multiple cameras, but I think it would be okay to require that each camera have dedicated processing hardware.

The PYNQ notebook examples that I've found either use the HDMI input or a webcam as a streaming video source. For my application I need the ability to process an RTSP (Real Time Streaming Protocol) stream over ethernet. I had hoped that I could just use the VideoCapture function in OpenCV, but I can't seem to get that to work. I'm sure that I'll be able to get something to work, but for the purposes of this roadtest I'm just going to use static images from the camera (actually from the NVR). I currently stream 2 resolutions from this particular camera (1280x720 and 640x480). I'd like to use the lower resolution stream for processing if it doesn't degrade the accuracy too much. I'm going to test that with the image captures from the NVR (the lower resolution captures from the NVR are actually only 320x176 - to allow for faster searching). It turns out that because the detection grid is a fixed ratio to the image that the large and small images have about the same execution time.

Night image (1280x720)

The car to the right is not detected.

Day image (1280x720)

Multiple bounding boxes for the same image

Truck image (320x176)

Truck image (1280x720)

class: car probability: 96% -- improved classification with larger image size (better resolution?)

Different truck (320x176)

Multiple cars (320x176)

did okay with the shadows

Me (320x176)

Multiple objects (1280x720)

Seems to have a harder time with people

Amazon and Mail trucks (1280x720)

Conclusion

So, I've got a few challenges ahead of me after this roadtest.

Figure out how to capture the RTSP stream (BTW, I do this successfully with a Raspberry Pi)
Quantify usable frame rate (currently taking over a second to execute)
Figure out to train with something that allows me to differentiate vehicles

Top Comments

nctiglao over 3 years ago in reply to weiwei2

Can someone share their work around getting PYNQ-Z2 to access RTSP streams?
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
weiwei2 over 4 years ago

RTSP is indeed challenging ... even currently i try to use RTSP over windows .net core software to do this is also challenging ... but it is good to know if someone has achieved it some day and provide an example.
- Cancel
- Vote Up +2 Vote Down
- Sign in to reply
- More
- Cancel
DAB over 5 years ago

Interesting results.

DAB
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel
genebren over 5 years ago in reply to ralphjy

Ralph,

I am not sure how much information you get from your classifier, but some additional testing might be able to figure out how to sub-classify objects. If you have a container of the object, you might be able to run a color histogram over the region and use color to sub-classify (big brown car = UPS truck). Size of the container might allow you to distinguish truck from car. There are a lot of ways to go.

I wrote (from scratch) a program to count cells from a microscope image. There were a lot of steps to normalizing and correcting the image (poor illumination and lack of focus in outer regions of the image), but it eventually was successful. Object classification occurred after segmentation, and the objects were sorted based on size (number of pixels). A histogram of sizes was used to determine the most popular size range (to determine the estimated object size). Smaller objects were ignored and larger objects were subjected to various attempts to de-cluster groups of cells into individuals (circle fitting, re-thresholding and segmenting, etc.). The resulting code was pretty fast and fairly accurate. It would have been interesting to have been able to try this in hardware, but that was not one of the options, as the hardware was already designed when I jumped in to improve/re-write the processing.

Good luck,
Gene
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel
ralphjy over 5 years ago in reply to genebren

Hi Gene,

The classifier was trained with the PASCAL Visual Object Classes which only has the following vehicle classes: airplane, bicycle, boat, bus, car, motorbike, train.

I also thought it was odd at first that it couldn't differentiate a truck. Of course, I want to differentiate the UPS truck vs Amazon vs the mail truck. I'm thinking because I don't have such a general purpose case that maybe I can figure out how to train with a small subset of vehicles.

Ralph
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel