Before I move on to object detection I thought I would try one more example of object classification using a more complex neural network based on the Multi-layer offload architecture. The network used is a variant of the DoReFa-Net and uses the large ImageNet dataset http://www.image-net.org/ for training. The DoReFa-Net https://arxiv.org/pdf/1606.06160 is a low bitwidth convolutional neural network that is trained with low bitwidth gradients optimized for implementation on hardware like FPGAs.
ImageNet Classifier:
The network topology is shown below. The pink layers are executed in the Programmable Logic at reduced precision (1 bit for weights, 2 bit for activations) while the other layers are executed in python.
Initialize the network
- Import libraries
- Instantiate classifier
- Load labels and synsets of the 1000 ImageNet classes into dictionaries
Code for initialization:
import os, pickle, random from datetime import datetime from matplotlib import pyplot as plt from PIL import Image %matplotlib inline import numpy as np import cv2 import qnn from qnn import Dorefanet from qnn import utils # Instantiate a classifier classifier = Dorefanet() classifier.init_accelerator() net = classifier.load_network(json_layer="/usr/local/lib/python3.6/dist-packages/qnn/params/dorefanet-layers.json") conv0_weights = np.load('/usr/local/lib/python3.6/dist-packages/qnn/params/dorefanet-conv0.npy', encoding="latin1").item() fc_weights = np.load('/usr/local/lib/python3.6/dist-packages/qnn/params/dorefanet-fc-normalized.npy', encoding='latin1').item() # Get ImageNet Classes information with open("/home/xilinx/jupyter_notebooks/qnn/imagenet-classes.pkl", 'rb') as f: classes = pickle.load(f) names = dict((k, classes[k][1].split(',')[0]) for k in classes.keys()) synsets = dict((classes[k][0], classes[k][1].split(',')[0]) for k in classes.keys())
Classify image
- Open image to be classified
- Execute the first convolutional layer in Python
- Compute HW Offload of the quantized layers
- Normalize using fully connected layers in python
Code for classification:
# Open image img_folder = "/home/xilinx/jupyter_notebooks/qnn/images/" img_file = os.path.join(img_folder, max(os.listdir(img_folder), key=lambda f: os.path.getctime(os.path.join(img_folder, f)))) img, img_class = classifier.load_image(img_file) im = Image.open(img_file) im # Execute first layer conv0_W = conv0_weights['conv0/W'] conv0_T = conv0_weights['conv0/T'] start = datetime.now() # 1st convolutional layer execution, having as input the image and the trained parameters (weights) conv0 = utils.conv_layer(img, conv0_W, stride=4) # The result in then quantized to 2 bits representation for the subsequent HW offload conv0 = utils.threshold(conv0, conv0_T) # Allocate accelerator output buffer end = datetime.now() micros = int((end - start).total_seconds() * 1000000) print("First layer SW implementation took {} microseconds".format(micros)) print(micros, file=open('timestamp.txt', 'w')) # Compute offloaded convolutional layers out_dim = net['merge4']['output_dim'] out_ch = net['merge4']['output_channels'] conv_output = classifier.get_accel_buffer(out_ch, out_dim); conv_input = classifier.prepare_buffer(conv0) start = datetime.now() classifier.inference(conv_input, conv_output) end = datetime.now() micros = int((end - start).total_seconds() * 1000000) print("HW implementation took {} microseconds".format(micros)) print(micros, file=open('timestamp.txt', 'a')) conv_output = classifier.postprocess_buffer(conv_output) # Normalize results fc_input = conv_output / np.max(conv_output) start = datetime.now() # FC Layer 0 fc0_W = fc_weights['fc0/Wn'] fc0_b = fc_weights['fc0/bn'] fc0_out = utils.fully_connected(fc_input, fc0_W, fc0_b) fc0_out = utils.qrelu(fc0_out) fc0_out = utils.quantize(fc0_out, 2) # FC Layer 1 fc1_W = fc_weights['fc1/Wn'] fc1_b = fc_weights['fc1/bn'] fc1_out = utils.fully_connected(fc0_out, fc1_W, fc1_b) fc1_out = utils.qrelu(fc1_out) # FC Layer 2 fct_W = fc_weights['fct/W'] fct_b = np.zeros((fct_W.shape[1], )) fct_out = utils.fully_connected(fc1_out, fct_W, fct_b) end = datetime.now() micros = int((end - start).total_seconds() * 1000000) print("Fully-connected layers took {} microseconds".format(micros)) print(micros, file=open('timestamp.txt', 'a'))
I tested the network with five images. The shark image was included with the notebook and I used the 2 dog and 2 puppy images downloaded from the Internet that I had used to test the CIFAR-10 binary network. Here are the results
Shark
Classification Result:
Execution time:
- First layer SW implementation too 654851 microseconds
- HW implementation took 79813 microseconds
- Fully-connected layers took 569449 microseconds
Total execution time: 1304113 microseconds
Full SW implementation execution time:
The network was also tested with the middle HW layer implemented in SW to determine the impact of the HW implementation.
Total execution time: 397517703
The network with the middle HW layer is about 300x faster!
The execution time profile is approximately the same for all the images, so I'll only provide the classification results for the rest of the images.
Dog1
Classification Result:
Dog2
Classification Result:
Puppy1
Classification Result:
Puppy2
Classification Result:
The Classifier struggled a bit with the puppies but I'll admit that I'm not sure what breeds (could be mixed) that they are either. I was impressed by how well it did with the other images.
Time to move on to what I really want to do..... object detection and identification within an image.