element14 Community
element14 Community
    Register Log In
  • Site
  • Search
  • Log In Register
  • About Us
  • Community Hub
    Community Hub
    • What's New on element14
    • Feedback and Support
    • Benefits of Membership
    • Personal Blogs
    • Members Area
    • Achievement Levels
  • Learn
    Learn
    • Ask an Expert
    • eBooks
    • element14 presents
    • Learning Center
    • Tech Spotlight
    • STEM Academy
    • Webinars, Training and Events
    • Learning Groups
  • Technologies
    Technologies
    • 3D Printing
    • FPGA
    • Industrial Automation
    • Internet of Things
    • Power & Energy
    • Sensors
    • Technology Groups
  • Challenges & Projects
    Challenges & Projects
    • Design Challenges
    • element14 presents Projects
    • Project14
    • Arduino Projects
    • Raspberry Pi Projects
    • Project Groups
  • Products
    Products
    • Arduino
    • Avnet Boards Community
    • Dev Tools
    • Manufacturers
    • Multicomp Pro
    • Product Groups
    • Raspberry Pi
    • RoadTests & Reviews
  • Store
    Store
    • Visit Your Store
    • Choose another store...
      • Europe
      •  Austria (German)
      •  Belgium (Dutch, French)
      •  Bulgaria (Bulgarian)
      •  Czech Republic (Czech)
      •  Denmark (Danish)
      •  Estonia (Estonian)
      •  Finland (Finnish)
      •  France (French)
      •  Germany (German)
      •  Hungary (Hungarian)
      •  Ireland
      •  Israel
      •  Italy (Italian)
      •  Latvia (Latvian)
      •  
      •  Lithuania (Lithuanian)
      •  Netherlands (Dutch)
      •  Norway (Norwegian)
      •  Poland (Polish)
      •  Portugal (Portuguese)
      •  Romania (Romanian)
      •  Russia (Russian)
      •  Slovakia (Slovak)
      •  Slovenia (Slovenian)
      •  Spain (Spanish)
      •  Sweden (Swedish)
      •  Switzerland(German, French)
      •  Turkey (Turkish)
      •  United Kingdom
      • Asia Pacific
      •  Australia
      •  China
      •  Hong Kong
      •  India
      •  Korea (Korean)
      •  Malaysia
      •  New Zealand
      •  Philippines
      •  Singapore
      •  Taiwan
      •  Thailand (Thai)
      • Americas
      •  Brazil (Portuguese)
      •  Canada
      •  Mexico (Spanish)
      •  United States
      Can't find the country/region you're looking for? Visit our export site or find a local distributor.
  • Translate
  • Profile
  • Settings
Path to Programmable 3
  • Challenges & Projects
  • Design Challenges
  • Path to Programmable 3
  • More
  • Cancel
Path to Programmable 3
Blog Path to Programmable III Training Blog #6 Final Blog-Build Machine Vision Project with Vitis-AI on IP-DPU
  • Blog
  • Forum
  • Documents
  • Leaderboard
  • Files
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • More
  • Cancel
  • New
Join Path to Programmable 3 to participate - click to join for free!
  • Share
  • More
  • Cancel
Group Actions
  • Group RSS
  • More
  • Cancel
Engagement
  • Author Author: fyaocn
  • Date Created: 5 Sep 2023 2:14 AM Date Created
  • Views 1142 views
  • Likes 4 likes
  • Comments 1 comment
Related
Recommended

Path to Programmable III Training Blog #6 Final Blog-Build Machine Vision Project with Vitis-AI on IP-DPU

fyaocn
fyaocn
5 Sep 2023

Build Machine Vision Project with Vitis-AI on IP-DPU

Table of Contents

  • 1 Ultra96v2 and Vitis AI
  • 2 Create Vitis AI development docker environment
  • 3 Create hardware design with vivado
  • 3   Model Hub and Vitis AI
  • 4 Create the IMG

1 Ultra96v2 and Vitis AI

1.1 With the training sessions, get to know Ultra96v2 with complete peripheral for embedded design project

image

The core is MPSoC XCZU3EG, combining PS and PL for  high performance IO,

image

This block diagram is very important in create device tree files (DTS) for petalinux building, there are not much changes for Ultra96 v1 and Ultra96 v2, configuration of LPDDR is only main different. But I still not able to create valid device tree for Ultra96, this is where I stuck. for which is needed for building petalinux image file with vitis toolset. Only ultra96 v1 is available for now, repository with ultra96v2 supported is still not released.

1.2 AMD VitisTm AI is an integrated development environment that can be leveraged to accelerate AI inference on AMD platforms, providing optimized IP, tools, libraries, models, the structure for Vitis AI is complex, but really easy to use if know well.

image

For Ultra96v2, use embedded Deep learning Procession Unit,

- first, create hardware design with vivado

- second, integrated Vitis AI library with Petalinux, in some community version, ubuntu for Vitis-AI is supported too.

- Third, train and import AI model, convert the model into xmodel format for AI core to load and coding with C++ or python as the developer's choice

For machine vision design, toughest part is AI model training and importing, with Vitis-AI, this is easy. refer to sample project on hackster .

2 Create Vitis AI development docker environment

Install docker and pull latest vitis-ai-tensorflow2-cpu image with

docker pull xilinux/vitis-ai-tensorflow2-cpu:latest

image

The image of over 40G, it takes long time to complete.

image

then run in docker sandbox

docker run xilinux/vitis-ai-tensorflow2-cpu:latest

image

under this environment, the machine learning model can be trained, optimized, prunded, quantified and converted to final xmodel file for AI acceleration on ultra96. Or one can use model in modelzoo directly, without go through the process again.

image

3 Create hardware design with vivado

3.1 launch Vivado and create new project u96_custom_platform

image

select ultra96-v2 board, the hardware file can be got from github in hdf format or just press refresh for latest available boards,

image

3.2 Create block design

image

add Zynq core- UltraScale+

image

add the Clocking Wizard to the block diagram and

  • Enable clk_out1 through clk_out3 in the Output Clock column. Set the Requested Output Freq as follows:

    • clk_out1 to 100 MHz.
    • clk_out2 to 200 MHz.
    • clk_out3 to 400 MHz
  • set the Reset Type to Active Low.

image

add three Processor System Reset block with name proc_sys_reset_1, proc_sys_reset_2 and proc_sys_reset_3

image

Run Connection Automation,

image

Enable clocks for the platform,

  • Enable all clocks under clk_wiz_0: clk_out1, clk_out2, clk_out3
  • Change their ID to 1, 2 and 3
  • Set a default clock: click Is Default for clk_out2

image

Enable AXI HPM0 LPD to control the AXI Interrupt Controller

  • Enable the AXI HPM0 LPD option.
  • Check the AXI HPM0 LPD Data width settings and keep it as default 32.
  • Disable AXI HPM0 FPD and AXI HPM1 FPD

image

Add the AXI Interrupt Controller and change Interrupt Output Connection to Single so that it can be connected to PS IRQ interface.

image

Add the AXI Interrupt Controller and instantiated as axi_intc_0.

image

Run Connection Automation, and Set Clock Source for Slave Interface and Clock Source for Master Interface to /clk_wiz_0/clk_out2(200 MHz)

image

Here is IP design connection,

image

3.2 Enable AXI Interfaces for the Platform

image

Enable AXI Slave interfaces from PS to allow kernels access DDR memory,

Under zynq_ultra_ps_e_0, multi-select all AXI slave interfaces: press Ctrl and click S_AXI_HPC0_FPD, S_AXI_HPC1_FPD, S_AXI_HP0_FPD, S_AXI_HP1_FPD, S_AXI_HP2_FPD, S_AXI_HP3_FPD. Right click the selections and select enable.

image

Change Memport of S_AXI_HPC0_FPD and S_AXI_HPC1_FPD to S_AXI_HP because we won't use any coherent features for these interfaces.

Type in simple sptag names for these interfaces so that they can be selected by v++ configuration during linking phase. HPC0, HPC1, HP0, HP1, HP2, HP3.

image

3.3 Click the Validate Design button first

image

Create HDL Wrapper...

image

image

Generate pre-synth design

image

Launch synthesis

image

Generate Bitstream

image

Now, explore the design, the GUI shows how the LUT on hardware design

image

image

Export the platform

image

with project name the u96v2_custom_platform_hw and choose output location,

image

Now, hardware design and output hardware design file xsa.

image

4   Create project with Vitis AI

Launch Vitis AI IDE and create new platform project

image

import the xsa file in psu_cortexA53 core

image

Generate boot components, and the project is created,

image

click the Build button to generate the platform.

image

Build finished

image

5   Create petalinux image with command XSCT

5.1 Download the common zynqmp file and unzip it

image

Type tree to explore the file structure, it is used as basis for boot and rootfs

image

copy source files to desitination directory

image

run sdk.sh to build the sdk

image

after all above preparation, command xsct can generate all neccessary file for petalinux image, it download libraries, device trees

xsct

image

there are some warning for files, during this process, until report device tree is error,

image

6   Step to go

If go well, two file for boot shall be generated such as boot.bin, then copy the files to destination directory and compress to petalinux image file, burn it with echer.io to SD card, the vitis libary is integrated into the image file, the camera can be started and load image into AI engine for project to build.

After the SD image is booted, select the display with

export DISPLAY=:0.0

Change the resolution of the DP monitor to a lower resolution, such as 640x480

$ xrandr --output DP-1 --mode 640x480

Launch the adas_detection application in vitis-ai library

$ cd ~/Vitis-AI/demo/VART/adas_detection
$ ./adas_detection ./video/adas.avi /usr/share/vitis_ai_library/models/yolov3_adas_pruned_0_9/yolov3_adas_pruned_0_9.xmodel

the above xmodel file is converted by vitis-ai docker environment, the coding is simple,load image with cv:read, then detect the object with function

    detect(boxes, result, channel, height, width, ii, sHeight, sWidth);

here is sample code for adas_detection

#include <dirent.h>
#include <sys/stat.h>
#include <zconf.h>

#include <algorithm>
#include <atomic>
#include <chrono>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <mutex>
#include <opencv2/opencv.hpp>
#include <queue>
#include <string>
#include <thread>
#include <vector>

#include "common.h"
#include "utils.h"

using namespace std;
using namespace cv;
using namespace std::chrono;

#define NMS_THRESHOLD 0.3f

int idxInputImage = 0;  // frame index of input video
int idxShowImage = 0;   // next frame index to be displayed
bool bReading = true;   // flag of reding input frame
bool bExiting = false;
chrono::system_clock::time_point start_time;

typedef pair<int, Mat> imagePair;
class paircomp {
 public:
  bool operator()(const imagePair& n1, const imagePair& n2) const {
    if (n1.first == n2.first) {
      return (n1.first > n2.first);
    }

    return n1.first > n2.first;
  }
};

// mutex for protection of input frames queue
mutex mtxQueueInput;
// mutex for protecFtion of display frmaes queue
mutex mtxQueueShow;
// input frames queue
queue<pair<int, Mat>> queueInput;
// display frames queue
priority_queue<imagePair, vector<imagePair>, paircomp> queueShow;
GraphInfo shapes;
/**
 * @brief Feed input frame into DPU for process
 *
 * @param task - pointer to DPU Task for YOLO-v3 network
 * @param frame - pointer to input frame
 * @param mean - mean value for YOLO-v3 network
 * @param input_scale - input scale , used to convert float to fixed
 *
 * @return none
 */
void setInputImageForYOLO(vart::Runner* runner, int8_t* data, const Mat& frame,
                          float* mean, float input_scale) {
  Mat img_copy;
  int width = shapes.inTensorList[0].width;
  int height = shapes.inTensorList[0].height;
  int size = shapes.inTensorList[0].size;
  image img_new = load_image_cv(frame);
  image img_yolo = letterbox_image(img_new, width, height);

  vector<float> bb(size);
  for (int b = 0; b < height; ++b) {
    for (int c = 0; c < width; ++c) {
      for (int a = 0; a < 3; ++a) {
        bb[b * width * 3 + c * 3 + a] =
            img_yolo.data[a * height * width + b * width + c];
      }
    }
  }

  float scale = pow(2, 7);
  for (int i = 0; i < size; ++i) {
    data[i] = (int8_t)(bb.data()[i] * input_scale);
    if (data[i] < 0) data[i] = (int8_t)((float)(127 / scale) * input_scale);
  }
  free_image(img_new);
  free_image(img_yolo);
}

/**
 * @brief Thread entry for reading image frame from the input video file
 *
 * @param fileName - pointer to video file name
 *
 * @return none
 */
void readFrame(const char* fileName) {
  static int loop = 3;
  VideoCapture video;
  string videoFile = fileName;
  start_time = chrono::system_clock::now();

  while (loop > 0) {
    loop--;
    if (!video.open(videoFile)) {
      cout << "Fail to open specified video file:" << videoFile << endl;
      exit(-1);
    }

    while (true) {
      usleep(20000);
      Mat img;
      if (queueInput.size() < 30) {
        if (!video.read(img)) {
          break;
        }

        mtxQueueInput.lock();
        queueInput.push(make_pair(idxInputImage++, img));
        mtxQueueInput.unlock();
      } else {
        usleep(10);
      }
    }

    video.release();
  }
  bExiting = true;
}

/**
 * @brief Thread entry for displaying image frames
 *
 * @param  none
 * @return none
 *
 */
void displayFrame() {
  Mat frame;

  while (true) {
    if (bExiting) break;
    mtxQueueShow.lock();

    if (queueShow.empty()) {
      mtxQueueShow.unlock();
      usleep(10);
    } else if (idxShowImage == queueShow.top().first) {
      auto show_time = chrono::system_clock::now();
      stringstream buffer;
      frame = queueShow.top().second;
      if (frame.rows <= 0 || frame.cols <= 0) {
        mtxQueueShow.unlock();
        continue;
      }
      auto dura = (duration_cast<microseconds>(show_time - start_time)).count();
      buffer << fixed << setprecision(1)
             << (float)queueShow.top().first / (dura / 1000000.f);
      string a = buffer.str() + " FPS";
      cv::putText(frame, a, cv::Point(10, 15), 1, 1, cv::Scalar{0, 0, 240}, 1);
      cv::imshow("ADAS Detection@Xilinx DPU", frame);

      idxShowImage++;
      queueShow.pop();
      mtxQueueShow.unlock();
      if (waitKey(1) == 'q') {
        bReading = false;
        exit(0);
      }
    } else {
      mtxQueueShow.unlock();
    }
  }
}

/**
 * @brief Post process after the running of DPU for YOLO-v3 network
 *
 * @param task - pointer to DPU task for running YOLO-v3
 * @param frame
 * @param sWidth
 * @param sHeight
 *
 * @return none
 */
void postProcess(vart::Runner* runner, Mat& frame, vector<int8_t*> results,
                 int sWidth, int sHeight, const float* output_scale) {
  const string classes[3] = {"car", "person", "cycle"};

  /* four output nodes of YOLO-v3 */
  // const string outputs_node[4] = {"layer81_conv", "layer93_conv",
  //                                    "layer105_conv", "layer117_conv"};

  vector<vector<float>> boxes;
  // auto  outputTensors = runner->get_output_tensors();
  for (int ii = 0; ii < 4; ii++) {
    int width = shapes.outTensorList[ii].width;
    int height = shapes.outTensorList[ii].height;
    int channel = shapes.outTensorList[ii].channel;
    int sizeOut = channel * width * height;
    vector<float> result(sizeOut);
    boxes.reserve(sizeOut);

    /* Store every output node results */
    get_output(results[ii], sizeOut, channel, height, width, output_scale[ii],
               result);

    /* Store the object detection frames as coordinate information  */
    detect(boxes, result, channel, height, width, ii, sHeight, sWidth);
  }

  /* Restore the correct coordinate frame of the original image */
  correct_region_boxes(boxes, boxes.size(), frame.cols, frame.rows, sWidth,
                       sHeight);

  /* Apply the computation for NMS */
  vector<vector<float>> res = applyNMS(boxes, classificationCnt, NMS_THRESHOLD);

  float h = frame.rows;
  float w = frame.cols;
  for (size_t i = 0; i < res.size(); ++i) {
    float xmin = (res[i][0] - res[i][2] / 2.0) * w + 1.0;
    float ymin = (res[i][1] - res[i][3] / 2.0) * h + 1.0;
    float xmax = (res[i][0] + res[i][2] / 2.0) * w + 1.0;
    float ymax = (res[i][1] + res[i][3] / 2.0) * h + 1.0;

    if (res[i][res[i][4] + 6] > CONF) {
      int type = res[i][4];
      string classname = classes[type];

      if (type == 0) {
        rectangle(frame, Point(xmin, ymin), Point(xmax, ymax),
                  Scalar(0, 0, 255), 1, 1, 0);
      } else if (type == 1) {
        rectangle(frame, Point(xmin, ymin), Point(xmax, ymax),
                  Scalar(255, 0, 0), 1, 1, 0);
      } else {
        rectangle(frame, Point(xmin, ymin), Point(xmax, ymax),
                  Scalar(0, 255, 255), 1, 1, 0);
      }
    }
  }
}

/**
 * @brief Thread entry for running YOLO-v3 network on DPU for acceleration
 *
 * @param task - pointer to DPU task for running YOLO-v3
 *
 * @return none
 */
void runYOLO(vart::Runner* runner) {
  /* mean values for YOLO-v3 */
  float mean[3] = {0.0f, 0.0f, 0.0f};
  auto inputTensors = cloneTensorBuffer(runner->get_input_tensors());
  int width = shapes.inTensorList[0].width;
  int height = shapes.inTensorList[0].height;
  auto outputTensors = cloneTensorBuffer(runner->get_output_tensors());

  auto input_scale = get_input_scale(runner->get_input_tensors()[0]);
  auto output_scale = vector<float>();
  for (int i=0; i < 4; i++) {
    output_scale.push_back(get_output_scale(
        runner->get_output_tensors()[shapes.output_mapping[i]]));
  }
  // input/output data define
  int8_t* data = new int8_t[shapes.inTensorList[0].size *
                            inputTensors[0]->get_shape().at(0)];
  int8_t* result0 =
      new int8_t[shapes.outTensorList[0].size *
                 outputTensors[shapes.output_mapping[0]]->get_shape().at(0)];
  int8_t* result1 =
      new int8_t[shapes.outTensorList[1].size *
                 outputTensors[shapes.output_mapping[1]]->get_shape().at(0)];
  int8_t* result2 =
      new int8_t[shapes.outTensorList[2].size *
                 outputTensors[shapes.output_mapping[2]]->get_shape().at(0)];
  int8_t* result3 =
      new int8_t[shapes.outTensorList[3].size *
                 outputTensors[shapes.output_mapping[3]]->get_shape().at(0)];
  vector<int8_t*> result;
  result.push_back(result0);
  result.push_back(result1);
  result.push_back(result2);
  result.push_back(result3);
  std::vector<std::unique_ptr<vart::TensorBuffer>> inputs, outputs;
  std::vector<vart::TensorBuffer*> inputsPtr, outputsPtr;
  while (true) {
    pair<int, Mat> pairIndexImage;

    mtxQueueInput.lock();
    if (queueInput.empty()) {
      mtxQueueInput.unlock();
      if (bExiting) break;
      if (bReading) {
        continue;
      } else {
        break;
      }
    } else {
      /* get an input frame from input frames queue */
      pairIndexImage = queueInput.front();
      queueInput.pop();
      mtxQueueInput.unlock();
    }
    /* feed input frame into DPU Task with mean value */

    setInputImageForYOLO(runner, data, pairIndexImage.second, mean,
                         input_scale);
    // input/output tensorbuffer prepare
    inputs.push_back(
        std::make_unique<CpuFlatTensorBuffer>(data, inputTensors[0].get()));

    outputs.push_back(std::make_unique<CpuFlatTensorBuffer>(
        result0, outputTensors[shapes.output_mapping[0]].get()));
    outputs.push_back(std::make_unique<CpuFlatTensorBuffer>(
        result1, outputTensors[shapes.output_mapping[1]].get()));
    outputs.push_back(std::make_unique<CpuFlatTensorBuffer>(
        result2, outputTensors[shapes.output_mapping[2]].get()));
    outputs.push_back(std::make_unique<CpuFlatTensorBuffer>(
        result3, outputTensors[shapes.output_mapping[3]].get()));
    inputsPtr.push_back(inputs[0].get());
    outputsPtr.resize(4);
    outputsPtr[shapes.output_mapping[0]] = outputs[0].get();
    outputsPtr[shapes.output_mapping[1]] = outputs[1].get();
    outputsPtr[shapes.output_mapping[2]] = outputs[2].get();
    outputsPtr[shapes.output_mapping[3]] = outputs[3].get();
    /* invoke the running of DPU for YOLO-v3 */
    auto job_id = runner->execute_async(inputsPtr, outputsPtr);
    runner->wait(job_id.first, -1);

    postProcess(runner, pairIndexImage.second, result, width, height,
                output_scale.data());
    mtxQueueShow.lock();

    /* push the image into display frame queue */
    queueShow.push(pairIndexImage);
    mtxQueueShow.unlock();
    inputs.clear();
    outputs.clear();
    inputsPtr.clear();
    outputsPtr.clear();
  }
  delete[] data;
  delete[] result0;
  delete[] result1;
  delete[] result2;
  delete[] result3;
}

/**
 * @brief Entry for running YOLO-v3 neural network for ADAS object detection
 *
 */
int main(const int argc, const char** argv) {
  if (argc != 3) {
    cout << "Usage of ADAS detection: " << argv[0]
         << " <video_file> <model_file>" << endl;
    return -1;
  }

  /* Create 4 DPU Tasks for YOLO-v3 network model */

  /* Spawn 6 threads:
  - 1 thread for reading video frame
  - 4 identical threads for running YOLO-v3 network model
  - 1 thread for displaying frame in monitor
  */
  // auto runners = vart::Runner::create_dpu_runner(argv[2]);
  auto graph = xir::Graph::deserialize(argv[2]);
  auto subgraph = get_dpu_subgraph(graph.get());
  CHECK_EQ(subgraph.size(), 1u)
      << "yolov3 should have one and only one dpu subgraph.";
  LOG(INFO) << "create running for subgraph: " << subgraph[0]->get_name();

  auto runner = vart::Runner::create_runner(subgraph[0], "run");
  auto runner1 = vart::Runner::create_runner(subgraph[0], "run");
  auto runner2 = vart::Runner::create_runner(subgraph[0], "run");
  auto runner3 = vart::Runner::create_runner(subgraph[0], "run");
  // get in/out tenosrs
  auto inputTensors = runner->get_input_tensors();
  auto outputTensors = runner->get_output_tensors();
  int inputCnt = inputTensors.size();
  int outputCnt = outputTensors.size();
  // init the shape info
  TensorShape inshapes[inputCnt];
  TensorShape outshapes[outputCnt];
  shapes.inTensorList = inshapes;
  shapes.outTensorList = outshapes;
  getTensorShape(runner.get(), &shapes, inputCnt,
                 {"layer81", "layer93", "layer105", "layer117"});

  array<thread, 6> threadsList = {
      thread(readFrame, argv[1]), thread(displayFrame),
      // thread(runYOLO, runner),
      thread(runYOLO, runner.get()), thread(runYOLO, runner1.get()),
      thread(runYOLO, runner2.get()), thread(runYOLO, runner3.get())};

  for (int i = 0; i < 6; i++) {
    threadsList[i].join();
  }

  return 0;
}

This project can still go when new xlnx library is released for ultra96v2.

  • Sign in to reply
  • prashanthgn.engineer
    prashanthgn.engineer over 2 years ago

    Nice 

    Step by step guide is very good.

    I will try to follow the steps and generate the new image

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • More
    • Cancel
element14 Community

element14 is the first online community specifically for engineers. Connect with your peers and get expert answers to your questions.

  • Members
  • Learn
  • Technologies
  • Challenges & Projects
  • Products
  • Store
  • About Us
  • Feedback & Support
  • FAQs
  • Terms of Use
  • Privacy Policy
  • Legal and Copyright Notices
  • Sitemap
  • Cookies

An Avnet Company © 2025 Premier Farnell Limited. All Rights Reserved.

Premier Farnell Ltd, registered in England and Wales (no 00876412), registered office: Farnell House, Forge Lane, Leeds LS12 2NE.

ICP 备案号 10220084.

Follow element14

  • X
  • Facebook
  • linkedin
  • YouTube