RoadTest: PYNQ-Z2 Dev Board: Python Productivity for Zynq®
Author: sambit1991
Creation date:
Evaluation Type: Development Boards & Tools
Did you receive all parts the manufacturer stated would be included in the package?: True
What other parts do you consider comparable to this product?: Ultra96
What were the biggest problems encountered?: 1) Very unorganized and inadequate documentation. 2) Key concepts such as overlay design and not very well explained. 3) The third party repositories such as BNN are not very well documented and mostly only support inference.
Detailed Review:
Element14 sent me a Pynq FPGA development kit and I had to road test it and provide a review of several topics such as :
Here is what I did with it.
The shipment came in a nice element14 box. The actual board and other accessories were neatly and securely packed inside in smaller boxes, with the Pynq Z2 having nice logos of Xilinx and TuL.
It also contained a USB cable for connecting to a laptop, an ethernet cable and power adapters.
I also found a SD card preloaded with Pynq image and an adapter for SD card.
Everything was pretty neat, secured and nicely packed and arrived on time.
Element14 helped me out a little more by shipping it on a preferred date, since I was not at home on the first planned date.
My major area of interest in using FPGA is in machine learning application. The review is done using some machine learning applications.
The key aspects of review are:
CIFAR 10 is a dataset containing 32*32 images of several categories:
print(hw_classifier.classes)
['Airplane', 'Automobile', 'Bird', 'Cat', 'Deer', 'Dog', 'Frog', 'Horse', 'Ship', 'Truck']
The detailed code can be found in the code section.
The application uses a quantized binary neural network with the following architecture:
6 convolutional layers
3 max pool layers
3 fully connected layers.
The network is pre-trained and the weights are stored.
Instantiating a hardware or software version of the classifier, loads the trained weights and makes the network ready for inference.
Since my interest is in automobile detection, I have tested this network with automobile images downloaded from the internet.
The downloaded images were first transferred to the Pynq. This was super easy - thanks to the network drive that can be accessed as : \\192.168.2.99\xilinx
The following code, iteratively fetches 9 images of cars from the Pynq file system, stored at :
It then passes each of the images to the classifiers instantiated in hardware and software respectively.
# take the above pieces and put together into a function import bnn from PIL import Image import numpy as np import time hw_classifier = bnn.CnvClassifier(bnn.NETWORK_CNVW1A1,'cifar10',bnn.RUNTIME_HW) sw_classifier = bnn.CnvClassifier(bnn.NETWORK_CNVW1A1,'cifar10',bnn.RUNTIME_SW) im_name = '/home/xilinx/jupyter_notebooks/SAMBIT_DATA/car' + str(4) + '.jpg' print("Classifying image : {0}".format(im_name)) im = Image.open(im_name) im def classify_images_cifar10(): """ Function classifies a series of images using the hardware classifier platform : hw -> hardware sw -> software """ print("Available classes") print(hw_classifier.classes) print ("========================== hardware classifications ==============================") for i in range(1,10): im_name = '/home/xilinx/jupyter_notebooks/SAMBIT_DATA/car' + str(i) + '.jpg' print("Classifying image : {0}".format(im_name)) im = Image.open(im_name) im class_out=hw_classifier.classify_image(im) print("Class number: {0}".format(class_out)) print("Class name: {0}".format(hw_classifier.class_name(class_out))) print("======================== software classifications ===============================") for i in range(1,10): im_name = '/home/xilinx/jupyter_notebooks/SAMBIT_DATA/car' + str(i) + '.jpg' print("Classifying image : {0}".format(im_name)) im = Image.open(im_name) im class_out = sw_classifier.classify_image(im) print("Class number: {0}".format(class_out)) print("Class name: {0}".format(sw_classifier.class_name(class_out))) return im if __name__ == '__main__': im = classify_images_cifar10() im
Following is the output:
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car4.jpg
Available classes
['Airplane', 'Automobile', 'Bird', 'Cat', 'Deer', 'Dog', 'Frog', 'Horse', 'Ship', 'Truck']
========================== hardware classifications ==============================
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car1.jpg
Inference took 1582.00 microseconds
Classification rate: 632.11 images per second
Class number: 8
Class name: Ship
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car2.jpg
Inference took 1582.00 microseconds
Classification rate: 632.11 images per second
Class number: 1
Class name: Automobile
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car3.jpg
Inference took 1583.00 microseconds
Classification rate: 631.71 images per second
Class number: 8
Class name: Ship
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car4.jpg
Inference took 1581.00 microseconds
Classification rate: 632.51 images per second
Class number: 1
Class name: Automobile
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car5.jpg
Inference took 1582.00 microseconds
Classification rate: 632.11 images per second
Class number: 1
Class name: Automobile
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car6.jpg
Inference took 1582.00 microseconds
Classification rate: 632.11 images per second
Class number: 1
Class name: Automobile
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car7.jpg
Inference took 1582.00 microseconds
Classification rate: 632.11 images per second
Class number: 0
Class name: Airplane
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car8.jpg
Inference took 1582.00 microseconds
Classification rate: 632.11 images per second
Class number: 8
Class name: Ship
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car9.jpg
Inference took 1582.00 microseconds
Classification rate: 632.11 images per second
Class number: 3
Class name: Cat
======================== software classifications ===============================
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car1.jpg
Inference took 1587185.00 microseconds
Classification rate: 0.63 images per second
Class number: 8
Class name: Ship
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car2.jpg
Inference took 1586030.00 microseconds
Classification rate: 0.63 images per second
Class number: 1
Class name: Automobile
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car3.jpg
Inference took 1586563.00 microseconds
Classification rate: 0.63 images per second
Class number: 8
Class name: Ship
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car4.jpg
Inference took 1586526.00 microseconds
Classification rate: 0.63 images per second
Class number: 1
Class name: Automobile
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car5.jpg
Inference took 1586699.00 microseconds
Classification rate: 0.63 images per second
Class number: 1
Class name: Automobile
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car6.jpg
Inference took 1604946.00 microseconds
Classification rate: 0.62 images per second
Class number: 1
Class name: Automobile
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car7.jpg
Inference took 1586711.00 microseconds
Classification rate: 0.63 images per second
Class number: 0
Class name: Airplane
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car8.jpg
Inference took 1586855.00 microseconds
Classification rate: 0.63 images per second
Class number: 8
Class name: Ship
Classifying image : /home/xilinx/jupyter_notebooks/SAMBIT_DATA/car9.jpg
Inference took 1586411.00 microseconds
Classification rate: 0.63 images per second
Class number: 3
Class name: Cat
Some of the images used are as follows:
Classifer | Accuracy | Time per image |
---|---|---|
Hardware | 44.44% | 1.5ms |
Software | 44.44% | 1586ms |
MNIST handwritten digit classification is widely regarded as the hello world of machine learning applications.
The task is to classify 28*28 images of handwritten digits into one of 9 digits 0 - 9.
The architecture of the tested network has just 3 fully connected layers.
The examples were adapted to make a single function, that collected pictures from the SD card and ran the classifier on them, in hardware and software.
Since I did not have a USB webcam, I simply saved some test images on the SD card on Pynq.
# Run over several handwritten digits to test the network import bnn import cv2 from PIL import Image as PIL_Image from PIL import ImageEnhance from PIL import ImageOps from PIL import Image as PIL_Image import numpy as np import math from scipy import misc from array import * from PIL import Image as PIL_Image from PIL import ImageOps img_load = PIL_Image.open('/home/xilinx/img_webcam_mnist.png').convert("L") hw_classifier = bnn.LfcClassifier(bnn.NETWORK_LFCW1A1,"mnist",bnn.RUNTIME_HW) sw_classifier = bnn.LfcClassifier(bnn.NETWORK_LFCW1A1,"mnist",bnn.RUNTIME_SW) for i in range(3, 10): im_path = '/home/xilinx/jupyter_notebooks/SAMBIT_DATA/' + str(i) + '.jpg' cv2_im = cv2.imread(im_path, 1) cv2_im = cv2.cvtColor(cv2_im,cv2.COLOR_BGR2RGB) img = PIL_Image.fromarray(cv2_im).convert("L") #original captured image #orig_img_path = '/home/xilinx/jupyter_notebooks/bnn/pictures/webcam_image_mnist.jpg' #img = PIL_Image.open(orig_img_path).convert("L") #Image enhancement contr = ImageEnhance.Contrast(img) img = contr.enhance(3) # The enhancement values (contrast and brightness) bright = ImageEnhance.Brightness(img) # depends on backgroud, external lights etc img = bright.enhance(4.0) #img = img.rotate(180) # Rotate the image (depending on camera orientation) #Adding a border for future cropping img = ImageOps.expand(img,border=80,fill='white') display(img) #Find bounding box inverted = ImageOps.invert(img) box = inverted.getbbox() img_new = img.crop(box) width, height = img_new.size ratio = min((28./height), (28./width)) background = PIL_Image.new('RGB', (28,28), (255,255,255)) if(height == width): img_new = img_new.resize((28,28)) elif(height>width): img_new = img_new.resize((int(width*ratio),28)) background.paste(img_new, (int((28-img_new.size[0])/2),int((28-img_new.size[1])/2))) else: img_new = img_new.resize((28, int(height*ratio))) background.paste(img_new, (int((28-img_new.size[0])/2),int((28-img_new.size[1])/2))) background img_data=np.asarray(background) img_data = img_data[:,:,0] misc.imsave('/home/xilinx/img_webcam_mnist.png', img_data) #Resize the image and invert it (white on black) smallimg = ImageOps.invert(img_load) smallimg = smallimg.rotate(0) data_image = array('B') pixel = smallimg.load() for x in range(0,28): for y in range(0,28): if(pixel[y,x] == 255): data_image.append(255) else: data_image.append(1) # Setting up the header of the MNIST format file - Required as the hardware is designed for MNIST dataset hexval = "{0:#0{1}x}".format(1,6) header = array('B') header.extend([0,0,8,1,0,0]) header.append(int('0x'+hexval[2:][:2],16)) header.append(int('0x'+hexval[2:][2:],16)) header.extend([0,0,0,28,0,0,0,28]) header[3] = 3 # Changing MSB for image data (0x00000803) data_image = header + data_image output_file = open('/home/xilinx/img_webcam_mnist_processed', 'wb') data_image.tofile(output_file) output_file.close() display(smallimg) class_out = hw_classifier.classify_mnist("/home/xilinx/img_webcam_mnist_processed") print("Class number: {0}".format(class_out)) print("Class name: {0}".format(hw_classifier.class_name(class_out))) print("============================= SOFTWARE ======================================") class_out=sw_classifier.classify_mnist("/home/xilinx/img_webcam_mnist_processed") print("Class number: {0}".format(class_out)) print("Class name: {0}".format(hw_classifier.class_name(class_out)))
Now, this classified almost every image wrong! So, I do not feel it is useful to show the wrong classification results in here, to save space and time.
However, just for the sake of speed calculations, consider the following plot:
As we can see, the hardware inference is approximately 1000 times faster than software. This means a lot in case of training and inference in deep neural networks.
This is a smaller implementation of the state-of-the-art You Only Look Once (YOLO) object detection algorithm.
It uses quantized weights for the convolution filters and the inference is accelerated in hardware.
This application tests the network with some random images downloaded from the internet.
out_dim = net['conv7']['output'][1] out_ch = net['conv7']['output'][0] # img_folder = './yoloimages/' # uncomment to reset img_folder = '/home/xilinx/jupyter_notebooks/SAMBIT_DATA/YOLO_TEST' file_name_out = c_char_p("/home/xilinx/jupyter_notebooks/qnn/detection".encode()) file_name_probs = c_char_p("/home/xilinx/jupyter_notebooks/qnn/probabilities.txt".encode()) file_names_voc = c_char_p("/opt/darknet/data/voc.names".encode()) tresh = c_float(0.3) tresh_hier = c_float(0.5) darknet_path = c_char_p("/opt/darknet/".encode()) conv_output = classifier.get_accel_buffer(out_ch, out_dim) while(1): for image_name in os.listdir(img_folder): img_file = os.path.join(img_folder, image_name) file_name = c_char_p(img_file.encode()) img = load_image(file_name,0,0) img_letterbox = letterbox_image(img,416,416) img_copy = np.copy(np.ctypeslib.as_array(img_letterbox.data, (3,416,416))) img_copy = np.swapaxes(img_copy, 0,2) free_image(img) free_image(img_letterbox) #First convolution layer in sw if len(img_copy.shape)<4: img_copy = img_copy[np.newaxis, :, :, :] conv0_ouput = utils.conv_layer(img_copy,conv0_weights_correct,b=conv0_bias_broadcast,stride=2,padding=1) conv0_output_quant = conv0_ouput.clip(0.0,4.0) conv0_output_quant = utils.quantize(conv0_output_quant/4,3) #Offload to hardware conv_input = classifier.prepare_buffer(conv0_output_quant*7); classifier.inference(conv_input, conv_output) conv7_out = classifier.postprocess_buffer(conv_output) #Last convolution layer in sw conv7_out = conv7_out.reshape(out_dim,out_dim,out_ch) conv7_out = np.swapaxes(conv7_out, 0, 1) # exp 1 if len(conv7_out.shape)<4: conv7_out = conv7_out[np.newaxis, :, :, :] conv8_output = utils.conv_layer(conv7_out,conv8_weights_correct,b=conv8_bias_broadcast,stride=1) conv8_out = conv8_output.ctypes.data_as(ctypes.POINTER(ctypes.c_float)) #Draw detection boxes lib.forward_region_layer_pointer_nolayer(net_darknet,conv8_out) lib.draw_detection_python(net_darknet, file_name, tresh, tresh_hier,file_names_voc, darknet_path, file_name_out, file_name_probs); #Display result IPython.display.clear_output(1) file_content = open(file_name_probs.value,"r").read().splitlines() detections = [] for line in file_content[0:]: name, probability = line.split(": ") detections.append((probability, name)) for det in sorted(detections, key=lambda tup: tup[0], reverse=True): print("class: {}\tprobability: {}".format(det[1], det[0])) res = Image.open(file_name_out.value.decode() + ".png") display(res) # time.sleep(5)
As can be seen from the video, the network performs really well!!
It mostly does a correct classification for all objects in the scene.
Only when there were several birds very close together, the network wrongly classified it as "aeroplane".
However, it is interesting to see that the detection and bounding box regression worked really well.
This is the best part, the part that interests me the most.
In this application, I followed some online resources to build a custom overlay that takes in two integers and returns their sum and product.
The logic itself is implemented in C++, on the FPGA fabric.
Pynq get's a handle to the overlay and I then use it to pass arguments and catch return values.
There is a nice surprise in this, read along!!
The FPGA C++ code:
void addmul(int a, int b, int& sm, int& pr) { #pragma HLS INTERFACE ap_ctrl_none port=return #pragma HLS INTERFACE s_axilite port=a #pragma HLS INTERFACE s_axilite port=b #pragma HLS INTERFACE s_axilite port=sm #pragma HLS INTERFACE s_axilite port=pr sm = a + b; } pr = a * b; }
The Block design:
The tcl and bit files can be found attached to try out.
The Python code for hardware acceleration:
from pynq import Overlay import time olay = Overlay('/home/xilinx/pynq/overlays/addmul/addmul_block.bit') ip = olay.addmul_0 test_list1 = [10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23] test_list2 = [23, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 10] print("Adding and multiplying {0} sets of numbers in hardware ".format(len(test_list1))) # t0 = time.clock() # start time # for i in range(len(test_list1)): # ip.write(0x10, test_list1[i]) # ip.write(0x18, test_list2[i]) # var1 = ip.read(0x20) # var2 = ip.read(0x28) # print("sum = {0}, prod = {1}".format(var1, var2)) # t1 = time.clock() # print("HW acceleration took : {0} uS".format((t1 - t0) * 1000000)) # mimick 1-D convolution t0 = time.clock() # start time for i in test_list1: for j in test_list2: ip.write(0x10, i) ip.write(0x18, j) var1 = ip.read(0x20) var2 = ip.read(0x28) print("sum = {0}, prod = {1}".format(var1, var2)) t1 = time.clock() print("HW acceleration took : {0} uS".format((t1 - t0) * 1000000))
Python code for software:
test_list1 = [10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23] test_list2 = [23, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 10] print("Adding and multiplying {0} sets of numbers in software ".format(len(test_list1))) # t0 = time.clock() # start time # for i in range(len(test_list1)): # var1 = test_list1[i] + test_list2[i] # var2 = test_list1[i] * test_list2[i] # print("sum = {0}, prod = {1}".format(var1, var2)) # t1 = time.clock() # print("SW took : {0} uS".format((t1 - t0) * 1000000)) # mimick 1D conv t0 = time.clock() # start time for i in test_list1: for j in test_list2: var1 = i + j var2 = i * j print("sum = {0}, prod = {1}".format(var1, var2)) t1 = time.clock() print("SW took : {0} uS".format((t1 - t0) * 1000000))
For the shown code, we see the following:
Whoa!!! Did you expect that? I did not, surely. Hardware acceleration actually took longer than the software.
Well, don't be disheartened. This is mainly due to the time lost in pushing data into the hardware, basically, memory read/write operations from the Python code space to the IP running on the hardware.
Let's try this:
test_list1 = [10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23] test_list2 = [23, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 10] print("Adding and multiplying {0} sets of numbers in hardware ".format(len(test_list1))) t0 = time.clock() # start time for i in range(len(test_list1)): ip.write(0x10, test_list1[i]) ip.write(0x18, test_list2[i]) var1 = ip.read(0x20) var2 = ip.read(0x28) print("sum = {0}, prod = {1}".format(var1, var2)) t1 = time.clock() print("HW acceleration took : {0} uS".format((t1 - t0) * 1000000))
test_list1 = [10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23] test_list2 = [23, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 10] print("Adding and multiplying {0} sets of numbers in software ".format(len(test_list1))) t0 = time.clock() # start time for i in range(len(test_list1)): var1 = test_list1[i] + test_list2[i] var2 = test_list1[i] * test_list2[i] print("sum = {0}, prod = {1}".format(var1, var2)) t1 = time.clock() print("SW took : {0} uS".format((t1 - t0) * 1000000))
And ....
hardware acceleration is the winner!!!
In this, I just reduced the number of memory operations by reducing the number of calls to the IP.
This is why, we generally want to pass all our data in a large chunk, like arrays or matrices, instead of one-by-one.
This was a wonderful experience for me, as my first road test review for a product.
The Pynq Z2 is truly an amazing low cost entry level hardware for student, hobbyists and experimenters who wish to leverage the hardware acceleration offered by FPGAs.
However, to be able to take advantage of all features, one needs to have an in-depth knowledge in Python, HDL design, Xilinx toolchain, Zynq architecture.
Since this is very hard to acquire all together, I feel it would be really useful if an exhaustive set of resources and tutorials were made available.
So, the only resources a developer needs to bring are time and patience.
Thank you Element14 for all the help and support and bearing with my often annoying and stupid questions and requests.
I have learnt quite a lot from this first review, most essentially, lot's of patience.
Though, I would have loved to do more with the Pynq, for now, I would stop here.
Looking forward to more such opportunities and great products to review.
Top Comments
Useful review - thank you.
I think your scoring was generous, but the text of the review tells a lot more.
MK
FYI there appears to be an error in the code in section 4.4 where you are adding and multiplying the values in the two lists.
There is a nested for loop:
which…