I was asked by my colleague ctammann to provide a design for the Kria KV260 Vision Starter Kit that would be a power hog for some power testing.
Although all of the Kria apps seems to be using a B3136 DPU, I did remember that Xilinx generated benchmarks for the KV260 using a B4096 DPU.
Looking at the benchmarks taken from the Xilinx Vitis-AI github repository:
https://github.com/Xilinx/Vitis-AI/blob/master/models/AI-Model-Zoo/README.md#performance-on-kria-kv260-som
I figured that the most compute intensive model should be a good candidate to stress the KV260 for power testing.
In the following table, I identify a few candidates highlighted in yellow, including the pointpainting model.
No. | Model | Name | E2E latency (ms) Thread Num =1 | E2E throughput (fps) Single Thread | E2E throughput (fps) Multi Thread |
1 | resnet50 | cf_resnet50_imagenet_224_224_7.7G | 13.68 | 73 | 77.1 |
2 | resnet18 | cf_resnet18_imagenet_224_224_3.65G | 5.37 | 186.1 | 213.8 |
3 | Inception_v1 | cf_inceptionv1_imagenet_224_224_3.16G | 5.46 | 183.2 | 210.6 |
4 | Inception_v2 | cf_inceptionv2_imagenet_224_224_4G | 7.53 | 132.8 | 147.8 |
5 | Inception_v3 | cf_inceptionv3_imagenet_299_299_11.4G | 16.86 | 59.3 | 63.7 |
6 | Inception_v4 | cf_inceptionv4_imagenet_299_299_24.5G | 34.32 | 29.1 | 30.1 |
7 | Mobilenet_v2 | cf_mobilenetv2_imagenet_224_224_0.59G | 3.95 | 252.9 | 310.4 |
8 | SqueezeNet | cf_squeezenet_imagenet_227_227_0.76G | 3.7 | 269.9 | 557.7 |
9 | ssd_pedestrian_pruned_0_97 | cf_ssdpedestrian_coco_360_640_0.97_5.9G | 12.76 | 78.3 | 107 |
10 | refinedet_baseline | cf_refinedet_coco_360_480_123G | 111.02 | 9 | 9.2 |
11 | refinedet_pruned_0_8 | cf_refinedet_coco_360_480_0.8_25G | 29.44 | 34 | 37 |
12 | refinedet_pruned_0_92 | cf_refinedet_coco_360_480_0.92_10.10G | 15.58 | 64.2 | 75.9 |
13 | refinedet_pruned_0_96 | cf_refinedet_coco_360_480_0.96_5.08G | 11.3 | 88.5 | 111.2 |
14 | ssd_adas_pruned_0_95 | cf_ssdadas_bdd_360_480_0.95_6.3G | 11.18 | 89.4 | 119.2 |
15 | ssd_traffic_pruned_0_9 | cf_ssdtraffic_360_480_0.9_11.6G | 17.65 | 56.6 | 74.3 |
16 | VPGnet_pruned_0_99 | cf_VPGnet_caltechlane_480_640_0.99_2.5G | 10.6 | 94.4 | 149.4 |
17 | ssd_mobilenet_v2 | cf_ssdmobilenetv2_bdd_360_480_6.57G | 40.41 | 24.7 | 61.7 |
18 | FPN | cf_fpn_cityscapes_256_512_8.9G | 29.41 | 34 | 76.4 |
19 | SP_net | cf_SPnet_aichallenger_224_128_0.54G | 1.88 | 530.4 | 680.3 |
20 | Openpose_pruned_0_3 | cf_openpose_aichallenger_368_368_0.3_189.7G | 274.99 | 3.6 | 5.5 |
21 | densebox_320_320 | cf_densebox_wider_320_320_0.49G | 2.37 | 421.7 | 845.4 |
22 | densebox_640_360 | cf_densebox_wider_360_640_1.11G | 4.85 | 206 | 418.5 |
23 | face_landmark | cf_landmark_celeba_96_72_0.14G | 1.25 | 801.4 | 938.1 |
24 | reid | cf_reid_market1501_160_80_0.95G | 3.02 | 331.2 | 379.1 |
25 | multi_task | cf_multitask_bdd_288_512_14.8G | 26.07 | 38.3 | 53.7 |
26 | yolov3_bdd | dk_yolov3_bdd_288_512_53.7G | 77.06 | 13 | 13.5 |
27 | yolov3_adas_pruned_0_9 | dk_yolov3_cityscapes_256_512_0.9_5.46G | 11.03 | 90.6 | 122.8 |
28 | yolov3_voc | dk_yolov3_voc_416_416_65.42G | 74.85 | 13.3 | 13.8 |
29 | yolov2_voc | dk_yolov2_voc_448_448_34G | 36.91 | 27.1 | 29 |
30 | yolov2_voc_pruned_0_66 | dk_yolov2_voc_448_448_0.66_11.56G | 15.15 | 66 | 76.7 |
31 | yolov2_voc_pruned_0_71 | dk_yolov2_voc_448_448_0.71_9.86G | 13.2 | 75.7 | 90.4 |
32 | yolov2_voc_pruned_0_77 | dk_yolov2_voc_448_448_0.77_7.82G | 11.36 | 88 | 108.5 |
33 | ResNet20-face | cf_facerec-resnet20_112_96_3.5G | 6.21 | 161.1 | 167 |
34 | ResNet64-face | cf_facerec-resnet64_112_96_11G | 13.71 | 72.9 | 74.1 |
35 | FPN_Res18_segmentation | cf_FPN-resnet18_EDD_320_320_45.3G | 78.11 | 12.8 | 16.8 |
36 | plate detection | cf_plate-detection_320_320_0.49G | 1.96 | 510.8 | 1059.4 |
37 | plate recognition | cf_plate-recognition_96_288_1.75G | 6.06 | 164.9 | 263.9 |
38 | retinaface | cf_retinaface_wider_360_640_1.11G | 7.85 | 127.4 | 267 |
39 | face_quality | cf_face-quality_80_60_61.68M | 0.45 | 2231 | 3736.3 |
40 | FPN-R18(light-weight) | cf_FPN-resnet18_Endov_240_320_13.75G | 28.8 | 34.7 | 64.1 |
41 | Hourglass | cf_hourglass_mpii_256_256_10.2G | 55.73 | 17.9 | 52.2 |
42 | tiny-yolov3 | dk_tiny-yolov3_416_416_5.46G | 8.11 | 123.3 | 165.7 |
43 | yolov4 | dk_yolov4_coco_416_416_60.1G | 72.21 | 13.8 | 14.8 |
44 | pruned_yolov4 | dk_yolov4_coco_416_416_0.36_38.2G | 53.7 | 18.6 | 20.5 |
45 | Inception_resnet_v2 | tf_inceptionresnetv2_imagenet_299_299_26.35G | 44.17 | 22.6 | 23.2 |
46 | Inception_v1 | tf_inceptionv1_imagenet_224_224_3G | 5.38 | 185.7 | 214.7 |
47 | Inception_v3 | tf_inceptionv3_imagenet_299_299_11.45G | 16.91 | 59.1 | 63.5 |
48 | Inception_v4 | tf_inceptionv4_imagenet_299_299_24.55G | 34.35 | 29.1 | 30.1 |
49 | Mobilenet_v1 | tf_mobilenetv1_0.25_imagenet_128_128_27.15M | 0.88 | 1133.2 | 2053.8 |
50 | Mobilenet_v1 | tf_mobilenetv1_0.5_imagenet_160_160_150.07M | 1.38 | 725.8 | 1127.5 |
51 | Mobilenet_v1 | tf_mobilenetv1_1.0_imagenet_224_224_1.14G | 3.3 | 302.9 | 387.5 |
52 | Mobilenet_v2 | tf_mobilenetv2_1.0_imagenet_224_224_0.59G | 4.05 | 247 | 299.9 |
53 | Mobilenet_v2 | tf_mobilenetv2_1.4_imagenet_224_224_1.16G | 5.52 | 181.1 | 208.2 |
54 | resnet_v1_50 | tf_resnetv1_50_imagenet_224_224_6.97G | 12.61 | 79.3 | 84.1 |
55 | resnet_v1_101 | tf_resnetv1_101_imagenet_224_224_14.4G | 23.13 | 43.2 | 44.6 |
56 | resnet_v1_152 | tf_resnetv1_152_imagenet_224_224_21.83G | 33.56 | 29.8 | 30.4 |
57 | vgg_16 | tf_vgg16_imagenet_224_224_30.96G | 52.04 | 19.2 | 19.5 |
58 | vgg_19 | tf_vgg19_imagenet_224_224_39.28G | 59.56 | 16.8 | 17 |
59 | ssd_mobilenet_v1 | tf_ssdmobilenetv1_coco_300_300_2.47G | 9.09 | 110 | 163.2 |
60 | ssd_mobilenet_v2 | tf_ssdmobilenetv2_coco_300_300_3.75G | 12.55 | 79.7 | 102.3 |
61 | ssd_resnet_50_v1_fpn | tf_ssdresnet50v1_fpn_coco_640_640_178.4G | 361.24 | 2.8 | 5.2 |
62 | yolov3_voc | tf_yolov3_voc_416_416_65.63G | 74.85 | 13.3 | 13.8 |
63 | mlperf_ssd_resnet34 | tf_mlperf_resnet34_coco_1200_1200_433G | 555.45 | 1.8 | 2.6 |
64 | Inception_v2 | tf_inceptionv2_imagenet_224_224_3.88G | 10.66 | 93.8 | 100.6 |
65 | resnet_v2_50 | tf_resnetv2_50_imagenet_299_299_13.1G | 27.89 | 35.8 | 42.8 |
66 | resnet_v2_101 | tf_resnetv2_101_imagenet_299_299_26.78G | 48.29 | 20.7 | 23.6 |
67 | resnet_v2_152 | tf_resnetv2_152_imagenet_299_299_40.47G | 67.53 | 14.8 | 16.1 |
68 | ssdlite_mobilenetv2 | tf_ssdlite_mobilenetv2_coco_300_300_1.5G | 9.77 | 102.3 | 143.1 |
69 | ssd_inceptionv2 | tf_ssdinceptionv2_coco_300_300_9.62G | 25.4 | 39.3 | 44.9 |
70 | Mobilenet_v2 | tf_mobilenetv2_cityscapes_1024_2048_132.74G | 621.96 | 1.6 | 2.7 |
71 | efficientnet-edgetpu-S | tf_efficientnet-edgetpu-S_imagenet_224_224_4.72G | 10.32 | 96.9 | 104.4 |
72 | efficientnet-edgetpu-M | tf_efficientnet-edgetpu-M_imagenet_240_240_7.34G | 14.22 | 70.3 | 74.4 |
73 | efficientnet-edgetpu-L | tf_efficientnet-edgetpu-L_imagenet_300_300_19.36G | 35.19 | 28.4 | 31.2 |
74 | mlperf_resnet50 | tf_mlperf_resnet50_imagenet_224_224_8.19G | 13.92 | 71.8 | 75.8 |
75 | refinedet | tf_refinedet_VOC_320_320_81.9G | 94.85 | 10.5 | 13 |
76 | mobilenet_edge_1.0 | tf_mobilenetEdge1.0_imagenet_224_224_990M | 5.00 | 199.7 | 234 |
77 | mobilenet_edge_0.75 | tf_mobilenetEdge0.75_imagenet_224_224_624M | 4.14 | 241.3 | 291.9 |
78 | refinedet_medical | tf_RefineDet-Medical_EDD_320_320_9.83G | 14.5 | 69 | 84.3 |
79 | pruned_rcan | tf_rcan_DIV2K_360_640_0.98_86.95G | 132.69 | 7.5 | 7.8 |
80 | resnet50 | tf2_resnet50_imagenet_224_224_7.76G | 14.11 | 70.8 | 74.9 |
81 | Mobilenet_v1 | tf2_mobilenetv1_imagenet_224_224_1.15G | 3.35 | 298.6 | 380.6 |
82 | Inception_v3 | tf2_inceptionv3_imagenet_299_299_11.5G | 17.03 | 58.7 | 63 |
83 | 2d-unet | tf2_2d-unet_nuclei_128_128_5.31G | 6.9 | 144.8 | 158.3 |
84 | ERFNet | tf2_erfnet_cityscapes_512_1024_54G | 145.22 | 6.9 | 12.8 |
85 | efficientnet-b0 | tf2_efficientnet-b0_imagenet_224_224_0.36G | - | - | - |
86 | ENet | pt_ENet_cityscapes_512_1024_8.6G | 111.6 | 9 | 21 |
87 | SemanticFPN | pt_SemanticFPN_cityscapes_256_512_10G | 29.41 | 34 | 76.6 |
88 | ResNet20-face | pt_facerec-resnet20_mixed_112_96_3.5G | 6.21 | 161 | 167.1 |
89 | face quality | pt_face-quality_80_60_61.68M | 0.49 | 2047.3 | 3742.3 |
90 | multi_task_v2 | pt_MT-resnet18_mixed_320_512_13.65G | 33.1 | 30.2 | 42.8 |
91 | face_reid_large | pt_facereid-large_96_96_515M | 1.19 | 839.6 | 1058.4 |
92 | face_reid_small | pt_facereid-small_80_80_90M | 0.53 | 1893.6 | 3055.3 |
93 | person_reid | pt_personreid-res50_market1501_256_128_5.4G | 10.42 | 96 | 102.3 |
94 | person_reid | pt_personreid-res18_market1501_176_80_1.1G | 3.03 | 329.9 | 368.9 |
95 | pointpillars | pt_pointpillars_kitti_12000_100_10.8G | 48.95 | 20.4 | 28.8 |
96 | salsanext | pt_salsanext_semantic-kitti_64_2048_0.6_20.4G | 181.06 | 5.5 | 18.6 |
97 | FPN-R18 (light-weight) | pt_FPN-resnet18_covid19-seg_352_352_22.7G | 26.63 | 37.5 | 39.9 |
98 | 2d-unet | pt_unet_chaos-CT_512_512_23.3G | 54.58 | 18.3 | 22.8 |
99 | surround-view pointpillars | pt_pointpillars_nuscenes_40000_64_108G | 465.21 | 2.1 | 5 |
100 | salsanext_v2 | pt_salsanextv2_semantic-kitti_64_2048_0.75_32G | 246.92 | 4.0 | 10.1 |
101 | centerpoint | pt_centerpoint_astyx_2560_40_54G | 784.6 | 1.3 | 4.2 |
102 | pointpainting | pt_pointpainting_nuscenes_126G | 820.21 | 1.2 | 2.6 |
103 | multi_task_v3 | pt_multitaskv3_mixed_320_512_25.44G | 61.39 | 16.3 | 25.9 |
104 | FADnet | pt_fadnet_sceneflow_576_960_359G | - | - | - |
105 | SA-gate | pt_sa-gate_NYUv2_360_360_178G | - | - | - |
106 | Bayesian Crowd Counting | pt_BCC_shanghaitech_800_1000_268.9G | 292.42 | 3.4 | 3.9 |
107 | PMG | pt_pmg_rp2k_224_224_2.28G | 6.82 | 146.5 | 160.1 |
108 | SemanticFPN-mobilenetv2 | pt_SemanticFPN-mobilenetv2_cityscapes_512_1024_5.4G | 100.17 | 10 | 28.1 |
- | Inception_v3 | torchvision_inception_v3 | 16.85 | 59.3 | 63.8 |
- | SqueezeNet | torchvision_squeezenet | 4.52 | 221.2 | 385 |
- | resnet50 | torchvision_resnet50 | 14.47 | 69.1 | 72.7 |
In order to try this out, I first had to get the Vitis-AI 1.4 design for KV260 up and running.
Here are some instructions to quickly get started with the Vitis-AI 1.4 design for KV260:
- Download the following SD card image:
https://www.xilinx.com/member/forms/download/design-license-xef.html?filename=xilinx-kv260-dpu-v2020.2-v1.4.0.img.gz - Program to 32GB micro SD card with Balena Etcher, then boot
- With an internet connection, install the following archives of video and image files:
root@xilinx-k26-starterkit-2020_2:~# wget https://www.xilinx.com/bin/public/openDownload?filename=vitis_ai_runtime_r1.4.0_image_video.tar.gz -O vitis_ai_runtime_r1.4.0_image_video.tar.gz
root@xilinx-k26-starterkit-2020_2:~#tar -xvzf vitis_ai_runtime_r1.4.0_image_video.tar.gz -C Vitis-AI/demo/VART
root@xilinx-k26-starterkit-2020_2:~#wget https://www.xilinx.com/bin/public/openDownload?filename=vitis_ai_library_r1.4.0_images.tar.gz -O vitis_ai_library_r1.4.0_images.tar.gz
root@xilinx-k26-starterkit-2020_2:~#tar -xvzf vitis_ai_library_r1.4.0_images.tar.gz -C Vitis-AI/demo/Vitis-AI-Library
root@xilinx-k26-starterkit-2020_2:~#wget https://www.xilinx.com/bin/public/openDownload?filename=vitis_ai_library_r1.4.0_video.tar.gz -O vitis_ai_library_r1.4.0_video.tar.gz
root@xilinx-k26-starterkit-2020_2:~#tar -xvzf vitis_ai_library_r1.4.0_video.tar.gz -C Vitis-AI/demo/Vitis-AI-Library - Query the DPU core with the xdputil utility:
root@xilinx-k26-starterkit-2020_2:~# xdputil query
{
"DPU IP Spec":{
"DPU Core Count":1,
"DPU Target Version":"v1.4.1",
"IP version":"v3.3.0",
"generation timestamp":"2021-06-07 19-15-00",
"git commit id":"df4d0c7",
"git commit time":2106071910,
"regmap":"1to1 version"
},
"VAI Version":{
"libvart-runner.so":"Xilinx vart-runner Version: 1.4.0-fa49b842f283242091476cf8e1ae4d242a2a838e 2021-07-14-07:13:01 ",
"libvitis_ai_library-dpu_task.so":"Xilinx vitis_ai_library dpu_task Version: 1.4.0-01d12d1134678e1400e683d32e88bc77886f2
247 2021-07-14 07:14:27 [UTC] ",
"libxir.so":"Xilinx xir Version: xir-ff89b11dcabb00eef6d148fcf660c8e6d02eb184 2021-07-14-07:12:09",
"target_factory":"target-factory.1.4.0 ce1b39e329cc06cb7545e8aa39174fb8b9969f0b"
},
"kernels":[
{
"DPU Arch":"DPUCZDX8G_ISA0_B4096_MAX_BG2",
"DPU Frequency (MHz)":300,
"IP Type":"DPU",
"Load Parallel":2,
"Load augmentation":"enable",
"Load minus mean":"disable",
"Save Parallel":2,
"XRT Frequency (MHz)":300,
"cu_addr":"0xa0010000",
"cu_handle":"0xaaaae4a4e220",
"cu_idx":0,
"cu_mask":1,
"cu_name":"DPUCZDX8G:DPUCZDX8G_1",
"device_id":0,
"fingerprint":"0x1000020f6014407",
"name":"DPU Core 0"
}
]
}
root@xilinx-k26-starterkit-2020_2:~#
- Optimize the DDR's QoS settings:
root@xilinx-k26-starterkit-2020_2:~# cd dpu_sw_optimize/zynqmp/
root@xilinx-k26-starterkit-2020_2:~/dpu_sw_optimize/zynqmp# ./zynqmp_dpu_optimize.sh
Start QoS config ...[✔]
root@xilinx-k26-starterkit-2020_2:~/dpu_sw_optimize/zynqmp# cd -
root@xilinx-k26-starterkit-2020_2:~#
- Turn off verbose:
root@xilinx-k26-starterkit-2020_2:~# dmesg -D
- Run the pointpainting performance example application for 30 seconds:
root@xilinx-k26-starterkit-2020_2:~# cd Vitis-AI/demo/Vitis-AI-Library/samples/pointpainting
root@xilinx-k26-starterkit-2020_2:~/Vitis-AI/demo/Vitis-AI-Library/samples/pointpainting# ./test_performance_pointpainting semanticfpn_nuimage_576_320_pt pointpainting_nuscenes_40000_64_0_pt pointpainting_nuscenes_40000_64_1_pt ./test_performance_pointpainting.list -t 1 -s 30
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1112 12:36:49.335223 2276 benchmark.hpp:184] writing report to <STDOUT>
I1112 12:36:49.335584 2276 benchmark.hpp:211] waiting for 0/30 seconds, 1 threads running
I1112 12:36:59.335726 2276 benchmark.hpp:211] waiting for 10/30 seconds, 1 threads running
I1112 12:37:09.335916 2276 benchmark.hpp:211] waiting for 20/30 seconds, 1 threads running
I1112 12:37:19.336179 2276 benchmark.hpp:219] waiting for threads terminated
FPS=1.2936
root@xilinx-k26-starterkit-2020_2:~/Vitis-AI/demo/Vitis-AI-Library/samples/pointpainting# ./test_performance_pointpainting semanticfpn_nuimage_576_320_pt pointpainting_nuscenes_40000_64_0_pt pointpainting_nuscenes_40000_64_1_pt ./test_performance_pointpainting.list -t 4 -s 30
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1112 12:36:49.335223 2276 benchmark.hpp:184] writing report to <STDOUT>
I1112 12:36:49.335584 2276 benchmark.hpp:211] waiting for 0/30 seconds, 4 threads running
I1112 12:36:59.335726 2276 benchmark.hpp:211] waiting for 10/30 seconds, 4 threads running
I1112 12:37:09.335916 2276 benchmark.hpp:211] waiting for 20/30 seconds, 4 threads running
I1112 12:37:19.336179 2276 benchmark.hpp:219] waiting for threads terminated
FPS=2.61017
- Query the power metrics:
root@xilinx-k26-starterkit-2020_2:~# xmutil platformstats -p
Power Utilization
SOM total power : 5350 mW
SOM total current : 1068 mA
SOM total voltage : 5005 mV
AMS CTRL
System PLLs voltage measurement, VCC_PSLL : 1201 mV
PL internal voltage measurement, VCC_PSBATT : 716 mV
Voltage measurement for six DDR I/O PLLs, VCC_PSDDR_PLL : 1797 mV
VCC_PSINTFP_DDR voltage measurement : 840 mV
PS Sysmon
LPD temperature measurement : 30 C
FPD temperature measurement (REMOTE) : 30 C
VCC PS FPD voltage measurement (supply 2) : 842 mV
PS IO Bank 500 voltage measurement (supply 6) : 1789 mV
VCC PS GTR voltage : 856 mV
VTT PS GTR voltage : 1804 mV
PL Sysmon
PL temperature : 29 C
root@xilinx-k26-starterkit-2020_2:~#
In order to show the power metrics while the AI application is running, the AI application can be run as a background task (with '&') as shown below:
root@xilinx-k26-starterkit-2020_2:~/Vitis-AI/demo/Vitis-AI-Library/samples/pointpainting# ./test_performance_pointpainting semanticfpn_nuimage_576_320_pt pointpainting_nuscenes_40000_64_0_pt pointpainting_nuscenes_40000_64_1_pt ./test_performance_pointpainting.list -t 4 -s 300 &
root@xilinx-k26-starterkit-2020_2:~# xmutil platformstats -p
Using this technique, we were able to measure the SOM total power above 10 W !
Cheers !
Top Comments