ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 4

From DAVE Developer's Wiki
Jump to: navigation, search
Info Box
NeuralNetwork.png Applies to Machine Learning


History[edit | edit source]

Version Date Notes
1.0.0 September 2020 First public release
1.1.0 November 2020 Added application written in Python (version 2B)

Introduction[edit | edit source]

This Technical Note (TN for short) belongs to the series introduced here. In particular, it illustrates the execution of different versions of an inference application (fruit classifier) that makes use of the model described in this section, when executed on the NXP i.MX8M Plus EVK board. In addition, this document compares the results achieved to the ones produced by i.MX8M-powered Mito8M SoM detailed here.

Specifically, the following versions of the application were tested:

  • Version 1: This version is the same described in this article. As such, inference is implemented in software and is applied to images retrieved from files.
  • Version 2A: This version is functionally equivalent to version 1, but it leverages the Neural Processing Unit (NPU) to hardware accelerate the inference.
  • Version 2B: This is a Python alternative to version 2A.
  • Version 3: This is like version 2A, but the inference is applied to the frames captured live from an image sensor.

Testbed[edit | edit source]

The kernel and the root file system of the tested platform were built with the L5.4.24_2.1.0 release of the Yocto Board Support Package (BSP) for i.MX 8 family of devices. They were built with support for eIQ: "a collection of software and development tools for NXP microprocessors and microcontrollers to do inference of neural network models on embedded systems".

The following table details the relevant specs of the testbed.

NXP Linux BSP release L5.4.24_2.1.0
Inference engine TensorFlow Lite 2.1
Maximum ARM cores frequency

[MHz]

1800
SDRAM memory frequency (LPDDR4)

[MHz]

2000
Governor ondemand

Model deployment and inference applications[edit | edit source]

Version 1 (C++)[edit | edit source]

The C++ application previously used and described here was adapted to work with the new NXP Linux BSP release. Now it uses OpenCV 4.2.0 to pre-process the input image and TensorFlow Lite (TFL) 2.1 as inference engine. It still supports all the 3 TFL models previously tested on Mito8M SoM:

  • 32-bit floating-point model
  • half-quantized model (post-training 8-bit quantization of the weights only)
  • fully-quantized model (TensorFlow v1 quantization-aware training and 8-bit quantization of the weights and activations).

Version 2A (C++)[edit | edit source]

The version 1 application was then modified to accelerate the inference using the NPU (ML module) of the i.MX8M Plus SoC. This is possible because the TensorFlow Lite library uses the Android NN API driver implementation from the GPU/ML module driver for running inference using the GPU/ML module.

Neither the floating-point nor the half-quantized models work with the NPU, however. Moreover, the GPU/ML module driver does not support per-channel quantization yet. Therefore post-training quantization of models with TensorFlow v2 cannot be used if the model is supposed to run on the GPU/ML module (inference on CPU does not have this limitation). TensorFlow v1 quantization-aware training and model conversion is recommended in this case. Therefore, only the fully-quantized model was tested with this version of the application.

Version 2B (Python)[edit | edit source]

The version 2A application was then ported to Python. This Python version is functionally equivalent to the 2A version, which is written in C++. The goal of version 2B is to make a comparison in terms of performance with respect to version 2A. Generally, Python has the advantage of being easier to work with, but at the cost of being slower to execute. However, in this case, regarding the inference computation, the performance is pretty much the same between the two versions. This is because the Python API's act only as a wrapper to the core TensorFlow library written in C++ (and other "fast" languages). As detailed in this section, the overall time is significantly different because it takes into account the pre/post-processing computations as well. These computations don't leverage the NPU accelerator and thus are more affected by the slower Python code. Nevertheless, in case the model used is much more complex as it usually occurs in real-world cases, this overhead could be still tolerable because it might be neglectable. In conclusion, the use of Python has not to be discarded a priori because of performance concerns. Depending on the specific use case, it can be a valid option to consider.

Version 3 (C++)[edit | edit source]

A new C++ application was written to apply the inference to the frames captured from the image sensor (OV5640) of a camera module, instead of images retrieved from files. This version uses OpenCV 4.2.0 to control the camera and to pre-process the frames. Like version 2, inference runs on NPU, so only the fully-quantized model was tested.

Running the applications[edit | edit source]

As stated in the first article of this series, one of the goals is to evaluate the performances of the inference applications. As known, before and after the execution of the inference, other operations, generally referred to as pre/post-processing, are performed. Technically speaking, these operations are not part of the actual inference and are measured separately.

In order to have reproducible and reliable results, some measures were taken:

  • When possible, the inference was repeated several times and the average execution time was computed
  • All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.

Version 1 (no NPU acceleration)[edit | edit source]

The following sections detail the execution of the first version of the classifier on the embedded platform. The number of threads was also tweaked in order to test different configurations. During the execution, the well-known htop utility was used to monitor the system. This tool is very convenient to get some useful information such as cores allocation, processor load, and number of running threads.

Floating-point model[edit | edit source]

The following dump refers to the execution of the application when using the floating-point model.

root@imx8mpevk:/mnt/ramdisk/image_classifier_eIQ_plus# build/image_classifier_cv 2 my_converted_model.tflite labels.txt testdata/red-apple1.jpg 
Number of threads: undefined
Warmup time: 92.4871 ms
Original image size: 600x600x3
Cropped image size: 600x600x3
Resized image size: 224x224x3
Input tensor index: 1
Input tensor name: conv2d_8_input
Selected order of channels: RGB
Selected pixel values range: 0-1
Filling time: 0.923276 ms
Inference time 1: 88.2438 ms
Inference time 2: 89.3992 ms
Inference time 3: 86.3731 ms
Average inference time: 88.0054 ms
Total prediction time: 88.9287 ms
Output tensor index: 0
Output tensor name: Identity
Top results:
 1      Red Apple
 1.13485e-10    Orange
 5.58774e-18    Avocado
 7.49401e-20    Hand
 1.40373e-22    Banana
Tweaking the number of threads[edit | edit source]

The following screenshots show the system status while executing the application with different values of the thread parameter.

Thread parameter unspecified


Thread parameter set to 1


Thread parameter set to 2

Half-quantized model[edit | edit source]

The following dump refers to the execution of the application in combination with the half-quantized model.

root@imx8mpevk:/mnt/ramdisk/image_classifier_eIQ_plus# build/image_classifier_cv 2 my_fruits_model_1.12_quant.tflite labels.txt testdata/red-apple1.jpg
Number of threads: undefined
Warmup time: 180.551 ms
Original image size: 600x600x3
Cropped image size: 600x600x3
Resized image size: 224x224x3
Input tensor index: 12
Input tensor name: conv2d_input
Selected order of channels: RGB
Selected pixel values range: 0-1
Filling time: 0.811773 ms
Inference time 1: 176.78 ms
Inference time 2: 184.297 ms
Inference time 3: 176.743 ms
Average inference time: 179.273 ms
Total prediction time: 180.085 ms
Output tensor index: 18
Output tensor name: dense_1/Softmax
Top results:
 1      Red Apple
 1.53349e-07    Orange
 1.67772e-15    Avocado
 7.44711e-18    Banana
 2.47029e-18    Hand


The following screenshot shows the system status during the execution. In this case, the thread parameter was unspecified.


Thread parameter unspecified

Fully-quantized model[edit | edit source]

The following dump refers to the execution of the application when using the fully-quantized model.

root@imx8mpevk:/mnt/ramdisk/image_classifier_eIQ_plus# build/image_classifier_cv 3 my_fruits_model_qatlegacy.tflite labels.txt testdata/red-apple1.jpg 
Number of threads: undefined
Warmup time: 88.5131 ms
Original image size: 600x600x3
Cropped image size: 600x600x3
Resized image size: 224x224x3
Input tensor index: 14
Input tensor name: conv2d_input
Selected order of channels: RGB
Selected pixel values range: NA
Filling time: 0.290634 ms
Inference time 1: 84.8542 ms
Inference time 2: 85.1227 ms
Inference time 3: 84.8016 ms
Average inference time: 84.9262 ms
Total prediction time: 85.2168 ms
Output tensor index: 5
Output tensor name: activation_5/Softmax
Top results:
 1      Red Apple
Tweaking the number of threads[edit | edit source]

The following screenshots show the system status while executing the application with different values of the thread parameter.

Thread parameter unspecified


Thread parameter set to 4

Version 2A (C++)[edit | edit source]

The execution of the version 2A of the classifier on the embedded platform is detailed below. During the execution, htop was used to monitor the system. Note that the first execution of model inference using the NN API always takes many times longer, because of model graph initialization needed by the GPU/ML module, as stated by NXP documentation. Therefore, the time needed for the first inference (warm up) is measured separately.

root@imx8mpevk:/mnt/ramdisk/image_classifier_eIQ_plus# build/image_classifier_cv 3 my_fruits_model_qatlegacy.tflite labels.txt testdata/red-apple1.jpg 
INFO: Created TensorFlow Lite delegate for NNAPI.
Applied NNAPI delegate
Warmup time: 3529.8 ms
Original image size: 600x600x3
Cropped image size: 600x600x3
Resized image size: 224x224x3
Input tensor index: 14
Input tensor name: conv2d_input
Selected order of channels: RGB
Selected pixel values range: NA
Filling time: 0.215756 ms
Inference time 1: 1.33429 ms
Inference time 2: 1.31204 ms
Inference time 3: 1.26541 ms
Average inference time: 1.30391 ms
Total prediction time: 1.51967 ms
Output tensor index: 5
Output tensor name: activation_5/Softmax
Top results:
 1      Red Apple

The following screenshot shows the system status while executing the application.


ML-TN-001 4 acceleration.png


It is worth to remember that, when using the NPU accelerator, it is not possible to select the number of threads.

Profiling model execution on NPU[edit | edit source]

For the sake of completeness, the eIQ profiler log is provided as well in the following box. According to NXP documentation, The log captures detailed information of the execution clock cycles and DDR data transmission in each layer. Note that the time needed for inference is longer than usual because of profiler overhead. The input command and the messages printed from the application are in bold to separate them from the log.

root@imx8mpevk:/mnt/ramdisk/image_classifier_eIQ_plus# build/image_classifier_cv 3 my_fruits_model_qatlegacy.tflite labels.txt testdata/red-apple1.jpg 
INFO: Created TensorFlow Lite delegate for NNAPI.
#productname=VIPNano-D+I, pid=0x9f
Created VX Thread: 0xa3ee5fb0
Applied NNAPI delegate
prev_ptrs = 0xffffa369c040
Can't support one shaderCoreCount!
---------------------------Begin VerifyTiling -------------------------
AXI-SRAM = 0 Bytes VIP-SRAM = 260096 Bytes SWTILING_PHASE_FEATURES[1, 1, 1]
  0 TP [(   3  224  224 1,   150528, 0x0xaaaab1874580(0x0xaaaab1874580, 0x(nil)) ->  224  224    3 1,   150528, 0x0xaaaab187db10(0x0xaaaab187db10, 0x(nil))) k(0 0    0,        0) pad(0 0) pool(0 0, 1 1)] C[  1]
  1 NN [( 224  224    3 1,   150528, 0x0xaaaab187db10(0x0xaaaab187db10, 0x(nil)) ->  111  111   32 1,   394272, 0x0xaaaab1881a90(0x0xaaaab1881a90, 0x(nil))) k(3 3    3,     1152) pad(0 0) pool(2 2, 2 2)] P[  0] C[  2]
  2 NN [( 111  111   32 1,   394272, 0x0xaaaab1881a90(0x0xaaaab1881a90, 0x(nil)) ->  109  109   32 1,   380192, 0x0xaaaab1884270(0x0xaaaab1884270, 0x(nil))) k(3 3   32,     9984) pad(0 0) pool(0 0, 1 1)] P[  1] C[  3]
  3 TP [( 109  109   32 1,   380192, 0x0xaaaab1884270(0x0xaaaab1884270, 0x(nil)) ->   54   54   32 1,    93312, 0x0xaaaab1887410(0x0xaaaab1887410, 0x(nil))) k(0 0    0,        0) pad(0 0) pool(2 2, 2 2)] P[  2] C[  4]
  4 NN [(  54   54   32 1,    93312, 0x0xaaaab1887410(0x0xaaaab1887410, 0x(nil)) ->   26   26   64 1,    43264, 0x0xaaaab188cd90(0x0xaaaab188cd90, 0x(nil))) k(3 3   32,    19968) pad(0 0) pool(2 2, 2 2)] P[  3] C[  5]
  5 NN [(  26   26   64 1,    43264, 0x0xaaaab188cd90(0x0xaaaab188cd90, 0x(nil)) ->   12   12  128 1,    18432, 0x0xaaaab1892710(0x0xaaaab1892710, 0x(nil))) k(3 3   64,    79616) pad(0 0) pool(2 2, 2 2)] P[  4] C[  6]
  6 TP [(  12   12  128 1,    18432, 0x0xaaaab1892710(0x0xaaaab1892710, 0x(nil)) ->  128   12   12 1,    18432, 0x0xaaaab1894ef0(0x0xaaaab1894ef0, 0x(nil))) k(0 0    0,        0) pad(0 0) pool(0 0, 1 1)] P[  5] C[  7]
  7 TP [(18432    1    1 1,    18432, 0x0xaaaab1894ef0(0x0xaaaab1894ef0, 0x(nil)) ->  256    1    1 1,      256, 0x0xaaaab18965b0(0x0xaaaab18965b0, 0x(nil))) k(0 0    0,        0) pad(0 0) pool(0 0, 1 1)] P[  6] C[  8]
  8 TP [( 256    1    1 1,      256, 0x0xaaaab18965b0(0x0xaaaab18965b0, 0x(nil)) ->    6    1    1 1,        6, 0x0xaaaab1897c10(0x0xaaaab1897c10, 0x(nil))) k(0 0    0,        0) pad(0 0) pool(0 0, 1 1)] P[  7] C[  9]
  9 SH [(   6    1    1 1,        6, 0x0xaaaab1897c10(0x0xaaaab1897c10, 0x(nil)) ->    6    1    1 1,        6, 0x0xaaaab187a200(0x0xaaaab187a200, 0x(nil))) k(0 0    0,        0) pad(0 0) pool(0 0, 1 1)] P[  8]

Detected Segments
AB_VS (0 - 1)
TL_VS (1 - 2)
AB_VS (3 - 8)
======================== Block [0 - 2] ==============================
  0 TP DD -> VS [(  150528,   150528), IC(       0), KC(       0)]
  1 NN VS -> VS [(  150528,    96000), IC(       0), KC(    1408)]
  2 NN VS -> DD [(   96000,        0), IC(       0), KC(   11648)]
------------------------------------------------------------------
Segment AB (0 - 0)
------------------------------------------------------------------
Segment Tiling (1 - 2)
[VS 24(  0, 24)(224) ->VS 11(  0, 11)( 27) P( 0) F(1)]    [VS 11(  0, 11)( 27) ->DD  9(  0,  9)(  0) P( 0) F(0)]    
[VS 52( 22, 74)(224) ->VS 25( 11, 36)( 27) P( 0) F(1)]    [VS 27(  9, 36)( 27) ->DD 25(  9, 34)(  0) P( 0) F(0)]    
[VS 52( 72,124)(224) ->VS 25( 36, 61)( 27) P( 0) F(1)]    [VS 27( 34, 61)( 27) ->DD 25( 34, 59)(  0) P( 0) F(0)]    
[VS 52(122,174)(224) ->VS 25( 61, 86)( 27) P( 0) F(1)]    [VS 27( 59, 86)( 27) ->DD 25( 59, 84)(  0) P( 0) F(0)]    
[VS 52(172,224)(224) ->VS 25( 86,111)( 27) P( 0) F(1)]    [VS 27( 84,111)( 27) ->DD 25( 84,109)(  0) P( 0) F(1)]    

AXISRAM: Estimate used 0  0.000000%  VIPSRAM: Estimate used 107040  41.154037% M = 25

AXISRAM: Peak used 0  0.000000% VIPSRAM: Peak used 259584  99.803146%
======================== Block [0 - 2] SUCCEED =========================
======================== Block [3 - 8] ==============================
  3 TP DD -> VS [(       0,    93312), IC(       0), KC(       0)]
  4 NN VS -> VS [(   93312,    43264), IC(       0), KC(   20608)]
  5 NN VS -> VS [(   43264,    18432), IC(       0), KC(   79744)]
  6 TP VS -> VS [(   18432,    18432), IC(       0), KC(       0)]
  7 TP VS -> VS [(   18432,      256), IC(       0), KC(       0)]
  8 TP VS -> DD [(     256,        0), IC(       0), KC(       0)]
------------------------------------------------------------------
Segment AB (3 - 8)

AXISRAM: Peak used 0  0.000000% VIPSRAM: Peak used 157184  60.433071%
======================== Block [3 - 8] SUCCEED =========================
F(1) F(0) 
F(1) F(0) 
F(1) F(0) 
F(1) F(0) 
F(1) F(1) 

 id IN [ x  y  w   h ]   OUT  [ x  y  w  h ] (tx, ty, kpc) (ic, kc, kc/ks, ks/eks, kernel_type)
   0 TP DD 0x(nil) [   0    0        3      224] -> VS 0x0x400800 [   0    0      224      224] (  0,   0,   0) (       0,        0, 0.000000%, 0.000000%, NONE)
   1 NN VS 0x0x400800 [   0    0      224       24] -> VS 0x0x425400 [   0    0      111       11] ( 32,   2,   6) (       0,     1408, 100.000000%, 122.222221%, DD)
   2 NN VS 0x0x425400 [   0    0      111       11] -> DD 0x(nil) [   0    0      109        9] ( 55,   2,   6) (       0,    11648, 100.000000%, 116.666664%, DD)
   1 NN VS 0x0x401b40 [   0   22      224       52] -> VS 0x0x4258c5 [   0   11      111       25] ( 56,   2,   6) (       0,     1408, 100.000000%, 122.222221%, DD)
   2 NN VS 0x0x4257e7 [   0    9      111       27] -> DD 0x0x3d5 [   0    9      109       25] ( 55,   2,   6) (       0,    11648, 100.000000%, 116.666664%, DD)
   1 NN VS 0x0x404700 [   0   72      224       52] -> VS 0x0x42639c [   0   36      111       25] ( 56,   2,   6) (       0,     1408, 100.000000%, 122.222221%, DD)
   2 NN VS 0x0x4262be [   0   34      111       27] -> DD 0x0xe7a [   0   34      109       25] ( 55,   2,   6) (       0,    11648, 100.000000%, 116.666664%, DD)
   1 NN VS 0x0x4072c0 [   0  122      224       52] -> VS 0x0x426e73 [   0   61      111       25] ( 56,   2,   6) (       0,     1408, 100.000000%, 122.222221%, DD)
   2 NN VS 0x0x426d95 [   0   59      111       27] -> DD 0x0x191f [   0   59      109       25] ( 55,   2,   6) (       0,    11648, 100.000000%, 116.666664%, DD)
   1 NN VS 0x0x409e80 [   0  172      224       52] -> VS 0x0x42794a [   0   86      111       25] ( 56,   2,   6) (       0,     1408, 100.000000%, 122.222221%, DD)
   2 NN VS 0x0x42786c [   0   84      111       27] -> DD 0x0x23c4 [   0   84      109       25] ( 55,   2,   6) (       0,    11648, 100.000000%, 116.666664%, DD)
   3 TP DD 0x(nil) [   0    0      109      109] -> VS 0x0x400800 [   0    0       54       54] (  0,   0,   0) (       0,        0, 0.000000%, 0.000000%, NONE)
   4 NN VS 0x0x400800 [   0    0       54       54] -> VS 0x0x41c500 [   0    0       26       26] ( 52,   6,   4) (       0,    20608, 100.000000%, 103.205132%, DD)
   5 NN VS 0x0x41c500 [   0    0       26       26] -> VS 0x0x400800 [   0    0       12       12] ( 24,  16,   5) (       0,    79744, 100.000000%, 100.160774%, DD)
   6 TP VS 0x0x400800 [   0    0       12       12] -> VS 0x0x422600 [   0    0      128       12] (  0,   0,   0) (       0,        0, 0.000000%, 0.000000%, NONE)
   7 TP VS 0x0x422600 [   0    0    18432        1] -> VS 0x0x400800 [   0    0      256        1] (  0,   0,   0) (       0,        0, 0.000000%, 0.000000%, NONE)
   8 TP VS 0x0x400800 [   0    0      256        1] -> DD 0x(nil) [   0    0        6        1] (  0,   0,   0) (       0,        0, 0.000000%, 0.000000%, NONE)
   9 SH DD 0x(nil) [   0    0        0        0] -> DD 0x(nil) [   0    0        0        0] (  0,   0,   0) (       0,        0, 0.000000%, 0.000000%, NONE)

PreLoadWeightBiases = 0  nan%
---------------------------End VerifyTiling -------------------------

ArchModelVersion: ARCHCTS@230121
SWTilingVersion: ARCHCTS@230121
ProfileMode: 0
NumNNCores:6
NumNNCoresInt8: 6
NumNNCoresInt16: 6
NumNNCoresFloat16: 0
NumTPCores: 3
NumTPLiteCores: 0
MadPerCore: 64
VIP7Version: 1
InBuffDepth: 9
AccumBufferDepth: 32
DPAmount: 3
XYDPX: 0
XYDPY: 0
ZDP: 3
AXISRAMSize: 0
VIPSRAMSize: 262144
L2CacheWidth: 32
USCCacheSize: 8
BrickMode: 0
SWTiling: 1
SmallBatchEnable: 0
SWTilingPhase1: 1
TPWithFCLayer: 1
TPCircularBufferSupport: 1
KERNEL_HEADER_NOT_CACHED_FIX: 0
NNFCNonPruneAccel: 0
Conv1x1HalfPerformance: 0
DDRLatency: 0
CacheLineModeDisabled: 0
PER_3D_TILE_BUBBLE_FIX: 1
SWConv1x1To1x2: 0
TP_LOCALIZATION_REORDER_DISABLED_Fix: 1
USCCacheControllers: 1
AsyncCopyPerfFix: 1
ZDP3NoCompressFix: 1
ZXDP3KernelReadConflictFix: 1
CoefDecodePerf: 2
VectorPrune: 1
EnableCacheDataFromSRAM: 1
IMAGE_PARTIAL_CACHE_FIX: 0
DDRReadBandWidthLimit: 3.80
DDRWriteBandWidthLimit: 3.80
DDRTotalBandWidthLimit: 3.80
AXISRAMReadBandWidthLimit: 16.00
AXISRAMWriteBandWidthLimit: 16.00
AXISRAMTotalBandWidthLimit: 16.00
AXIBusReadBandWidthLimit: 16.00
AXIBusWriteBandWidthLimit: 16.00
AXIBusTotalBandWidthLimit: 32.00

HANDLE_ABBUFFER: 1
HANDLE_SUBIMAGE: 1
HANDLE_BRANCH: 1

FreqInMHZ: 1000
AxiClockFreqInMHZ: 1000
OutstandingTransfer: 64
InternalWriteBWLimit: 16.00

LanesPerConv: 64
MaxTileSize: 64
AxiSramSlowedDownByAddr: 1
SLOW_NN_REQ_ARBITRATION_FIX: 0

FLOAT_XYDP_X: 1
FLOAT_XYDP_Y: 1
FLOAT_ZDP: 1
SINGLE_PORT_ACC_BUFFER: 1
MAX_ZRL_BIT_WIDTH: 8
MAX_SOC_OUT_STANDING_NUMBER: 32

SWTilingPhase3: 1
AXI_SRAM_ONLY_SW_TILING: 0
VIP_CORE_COUNT: 1
DEPTH_WISE_SUPPORT: 1
NN_WRITE_WITHOUT_USC: 0
EQUIVALENT_VIP_SRAM_WIDTH_IN_BYTE: 32
IMAGE_NOT_PACKED_IN_SRAM: 0
NN_COEF_COMPRESSION_ENHANCEMENT: 1
TP_COMPRESSION_ENHANCEMENT: 1
COEF_DELTA_CORD_OVER_FLOW_ZRL_8BIT_FIX: 1
NumShaderCores: 1
KERNEL_PER_CORE_LESS_THAN_THIRD_COEF_BUFF_DEPTH_FIX: 0
LOW_EFFICIENCY_OF_ID_WRITE_IMGBUF_FIX: 0
DR_JD_Diff_For_Cacheline_Mode_FIX: 1
CONVOUT_FIFO_DEPTH_FIX: 1


===========================
**********Show Perf********
===========================
layer_id:0 layer_name:TensorTranspose
operation_id:0 operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP
abs_op_id:0
upstream_layer_num:0 upstream_opertaion_num:0
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_CONVOLUTION (downstream_layer_id:4 downstream_layer_name:ConvolutionReluPoolingLayer2)
InImageX: 3
InImageY: 224
InImageZ: 224
OutImageX: 224 (sub: 224)
OutImageY: 224 (sub: 224)
OutImageZ: 3 (sub: 3)
KernelX: 1
KernelY: 1
KernelZ: 224
PoolingSize: 1
PoolingStride: 1
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 0
kernelSize: 0
SrcBuf: DDR
DstBuf: VIP_SRAM
KernelBuf: DDR
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0

kernelDDRReadBW: 0
InImageDDrReadBW: 150528
ReadBW: 150656
WriteBW: 0
CycleCount: 77927


===========================
**********Show Perf********
===========================
layer_id:4 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:1
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_TENSOR_TRANS (upstream_layer_id:0 upstream_layer_name:TensorTranspose)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_CONVOLUTION (downstream_layer_id:5 downstream_layer_name:ConvolutionReluPoolingLayer2)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 224
OrigInImageY: 224
OrigInImageZ: 3
NNOutImageX: 222 (sub: 222)
NNOutImageY: 22 (sub: 22)
NNOutImageZ: 32 (sub: 32)
FinalOutImageX: 111
FinalOutImageY: 111
FinalOutImageZ: 32
KernelX: 3
KernelY: 3
KernelZ: 3
PoolingSize: 2
PoolingStride: 2
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 1352
kernelSize: 1408
SrcBuf: VIP_SRAM
DstBuf: VIP_SRAM
KernelBuf: DDR
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_FULL_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 1.000000000000000
coefCompression: 1.354838709677419
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182418800017408
coefCompression_llu: 4608780470280697261
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 32
OutImageTileYSize: 2
KernelsPerCore: 6

kernelDDRReadBW: 6099
InImageDDrReadBW: 0
ReadBW: 6227
WriteBW: 0
CycleCount: 12213


===========================
**********Show Perf********
===========================
layer_id:5 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:2
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_CONVOLUTION (upstream_layer_id:4 upstream_layer_name:ConvolutionReluPoolingLayer2)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_POOLING (downstream_layer_id:1 downstream_layer_name:PoolingLayer2)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 111
OrigInImageY: 111
OrigInImageZ: 32
NNOutImageX: 109 (sub: 109)
NNOutImageY: 9 (sub: 9)
NNOutImageZ: 32 (sub: 32)
FinalOutImageX: 109
FinalOutImageY: 109
FinalOutImageZ: 32
KernelX: 3
KernelY: 3
KernelZ: 32
PoolingSize: 1
PoolingStride: 1
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 11113
kernelSize: 11648
SrcBuf: VIP_SRAM
DstBuf: DDR
KernelBuf: DDR
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_FULL_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 1.000000000000000
coefCompression: 0.965753424657534
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182418800017408
coefCompression_llu: 4606873953072115319
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 55
OutImageTileYSize: 2
KernelsPerCore: 6

kernelDDRReadBW: 9540
InImageDDrReadBW: 0
ReadBW: 9668
WriteBW: 37746
CycleCount: 14667


===========================
**********Show Perf********
===========================
layer_id:4 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:1
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_TENSOR_TRANS (upstream_layer_id:0 upstream_layer_name:TensorTranspose)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_CONVOLUTION (downstream_layer_id:5 downstream_layer_name:ConvolutionReluPoolingLayer2)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 224
OrigInImageY: 224
OrigInImageZ: 3
NNOutImageX: 222 (sub: 222)
NNOutImageY: 50 (sub: 50)
NNOutImageZ: 32 (sub: 32)
FinalOutImageX: 111
FinalOutImageY: 111
FinalOutImageZ: 32
KernelX: 3
KernelY: 3
KernelZ: 3
PoolingSize: 2
PoolingStride: 2
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 1352
kernelSize: 1408
SrcBuf: VIP_SRAM
DstBuf: VIP_SRAM
KernelBuf: VIP_SRAM
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_STREAM_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 1.000000000000000
coefCompression: 1.354838709677419
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182418800017408
coefCompression_llu: 4608780470280697261
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 56
OutImageTileYSize: 2
KernelsPerCore: 6

kernelDDRReadBW: 7571
InImageDDrReadBW: 0
ReadBW: 7699
WriteBW: 0
CycleCount: 24949


===========================
**********Show Perf********
===========================
layer_id:5 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:2
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_CONVOLUTION (upstream_layer_id:4 upstream_layer_name:ConvolutionReluPoolingLayer2)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_POOLING (downstream_layer_id:1 downstream_layer_name:PoolingLayer2)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 111
OrigInImageY: 111
OrigInImageZ: 32
NNOutImageX: 109 (sub: 109)
NNOutImageY: 25 (sub: 25)
NNOutImageZ: 32 (sub: 32)
FinalOutImageX: 109
FinalOutImageY: 109
FinalOutImageZ: 32
KernelX: 3
KernelY: 3
KernelZ: 32
PoolingSize: 1
PoolingStride: 1
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 11113
kernelSize: 11648
SrcBuf: VIP_SRAM
DstBuf: DDR
KernelBuf: VIP_SRAM
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_STREAM_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 1.000000000000000
coefCompression: 0.965753424657534
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182418800017408
coefCompression_llu: 4606873953072115319
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 55
OutImageTileYSize: 2
KernelsPerCore: 6

kernelDDRReadBW: 10564
InImageDDrReadBW: 0
ReadBW: 10692
WriteBW: 104716
CycleCount: 32561


===========================
**********Show Perf********
===========================
layer_id:4 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:1
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_TENSOR_TRANS (upstream_layer_id:0 upstream_layer_name:TensorTranspose)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_CONVOLUTION (downstream_layer_id:5 downstream_layer_name:ConvolutionReluPoolingLayer2)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 224
OrigInImageY: 224
OrigInImageZ: 3
NNOutImageX: 222 (sub: 222)
NNOutImageY: 50 (sub: 50)
NNOutImageZ: 32 (sub: 32)
FinalOutImageX: 111
FinalOutImageY: 111
FinalOutImageZ: 32
KernelX: 3
KernelY: 3
KernelZ: 3
PoolingSize: 2
PoolingStride: 2
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 1352
kernelSize: 1408
SrcBuf: VIP_SRAM
DstBuf: VIP_SRAM
KernelBuf: VIP_SRAM
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_STREAM_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 1.000000000000000
coefCompression: 1.354838709677419
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182418800017408
coefCompression_llu: 4608780470280697261
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 56
OutImageTileYSize: 2
KernelsPerCore: 6

kernelDDRReadBW: 7571
InImageDDrReadBW: 0
ReadBW: 7699
WriteBW: 0
CycleCount: 24949


===========================
**********Show Perf********
===========================
layer_id:5 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:2
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_CONVOLUTION (upstream_layer_id:4 upstream_layer_name:ConvolutionReluPoolingLayer2)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_POOLING (downstream_layer_id:1 downstream_layer_name:PoolingLayer2)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 111
OrigInImageY: 111
OrigInImageZ: 32
NNOutImageX: 109 (sub: 109)
NNOutImageY: 25 (sub: 25)
NNOutImageZ: 32 (sub: 32)
FinalOutImageX: 109
FinalOutImageY: 109
FinalOutImageZ: 32
KernelX: 3
KernelY: 3
KernelZ: 32
PoolingSize: 1
PoolingStride: 1
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 11113
kernelSize: 11648
SrcBuf: VIP_SRAM
DstBuf: DDR
KernelBuf: VIP_SRAM
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_STREAM_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 1.000000000000000
coefCompression: 0.965753424657534
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182418800017408
coefCompression_llu: 4606873953072115319
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 55
OutImageTileYSize: 2
KernelsPerCore: 6

kernelDDRReadBW: 10564
InImageDDrReadBW: 0
ReadBW: 10692
WriteBW: 104716
CycleCount: 32561


===========================
**********Show Perf********
===========================
layer_id:4 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:1
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_TENSOR_TRANS (upstream_layer_id:0 upstream_layer_name:TensorTranspose)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_CONVOLUTION (downstream_layer_id:5 downstream_layer_name:ConvolutionReluPoolingLayer2)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 224
OrigInImageY: 224
OrigInImageZ: 3
NNOutImageX: 222 (sub: 222)
NNOutImageY: 50 (sub: 50)
NNOutImageZ: 32 (sub: 32)
FinalOutImageX: 111
FinalOutImageY: 111
FinalOutImageZ: 32
KernelX: 3
KernelY: 3
KernelZ: 3
PoolingSize: 2
PoolingStride: 2
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 1352
kernelSize: 1408
SrcBuf: VIP_SRAM
DstBuf: VIP_SRAM
KernelBuf: VIP_SRAM
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_STREAM_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 1.000000000000000
coefCompression: 1.354838709677419
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182418800017408
coefCompression_llu: 4608780470280697261
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 56
OutImageTileYSize: 2
KernelsPerCore: 6

kernelDDRReadBW: 7571
InImageDDrReadBW: 0
ReadBW: 7699
WriteBW: 0
CycleCount: 24949


===========================
**********Show Perf********
===========================
layer_id:5 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:2
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_CONVOLUTION (upstream_layer_id:4 upstream_layer_name:ConvolutionReluPoolingLayer2)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_POOLING (downstream_layer_id:1 downstream_layer_name:PoolingLayer2)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 111
OrigInImageY: 111
OrigInImageZ: 32
NNOutImageX: 109 (sub: 109)
NNOutImageY: 25 (sub: 25)
NNOutImageZ: 32 (sub: 32)
FinalOutImageX: 109
FinalOutImageY: 109
FinalOutImageZ: 32
KernelX: 3
KernelY: 3
KernelZ: 32
PoolingSize: 1
PoolingStride: 1
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 11113
kernelSize: 11648
SrcBuf: VIP_SRAM
DstBuf: DDR
KernelBuf: VIP_SRAM
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_STREAM_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 1.000000000000000
coefCompression: 0.965753424657534
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182418800017408
coefCompression_llu: 4606873953072115319
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 55
OutImageTileYSize: 2
KernelsPerCore: 6

kernelDDRReadBW: 10564
InImageDDrReadBW: 0
ReadBW: 10692
WriteBW: 104716
CycleCount: 32561


===========================
**********Show Perf********
===========================
layer_id:4 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:1
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_TENSOR_TRANS (upstream_layer_id:0 upstream_layer_name:TensorTranspose)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_CONVOLUTION (downstream_layer_id:5 downstream_layer_name:ConvolutionReluPoolingLayer2)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 224
OrigInImageY: 224
OrigInImageZ: 3
NNOutImageX: 222 (sub: 222)
NNOutImageY: 50 (sub: 50)
NNOutImageZ: 32 (sub: 32)
FinalOutImageX: 111
FinalOutImageY: 111
FinalOutImageZ: 32
KernelX: 3
KernelY: 3
KernelZ: 3
PoolingSize: 2
PoolingStride: 2
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 1352
kernelSize: 1408
SrcBuf: VIP_SRAM
DstBuf: VIP_SRAM
KernelBuf: VIP_SRAM
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_STREAM_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 1.000000000000000
coefCompression: 1.354838709677419
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182418800017408
coefCompression_llu: 4608780470280697261
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 56
OutImageTileYSize: 2
KernelsPerCore: 6

kernelDDRReadBW: 7571
InImageDDrReadBW: 0
ReadBW: 7699
WriteBW: 0
CycleCount: 24949


===========================
**********Show Perf********
===========================
layer_id:5 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:2
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_CONVOLUTION (upstream_layer_id:4 upstream_layer_name:ConvolutionReluPoolingLayer2)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_POOLING (downstream_layer_id:1 downstream_layer_name:PoolingLayer2)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 111
OrigInImageY: 111
OrigInImageZ: 32
NNOutImageX: 109 (sub: 109)
NNOutImageY: 25 (sub: 25)
NNOutImageZ: 32 (sub: 32)
FinalOutImageX: 109
FinalOutImageY: 109
FinalOutImageZ: 32
KernelX: 3
KernelY: 3
KernelZ: 32
PoolingSize: 1
PoolingStride: 1
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 11113
kernelSize: 11648
SrcBuf: VIP_SRAM
DstBuf: DDR
KernelBuf: VIP_SRAM
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_STREAM_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 1.000000000000000
coefCompression: 0.965753424657534
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182418800017408
coefCompression_llu: 4606873953072115319
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 55
OutImageTileYSize: 2
KernelsPerCore: 6

kernelDDRReadBW: 10564
InImageDDrReadBW: 0
ReadBW: 10692
WriteBW: 104716
CycleCount: 32561


===========================
**********Show Perf********
===========================
layer_id:1 layer_name:PoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_TP
abs_op_id:3
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_CONVOLUTION (upstream_layer_id:5 upstream_layer_name:ConvolutionReluPoolingLayer2)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_CONVOLUTION (downstream_layer_id:6 downstream_layer_name:ConvolutionReluPoolingLayer2)
InImageX: 109 (sub: 109)
InImageY: 109 (sub: 109)
InImageZ: 32 (sub: 32)
OutImageX: 54
OutImageY: 54
OutImageZ: 32
KernelX: 1
KernelY: 1
KernelZ: 32
PoolingSize: 2
PoolingStride: 2
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 0
kernelSize: 0
SrcBuf: DDR
DstBuf: VIP_SRAM
KernelBuf: DDR
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0

kernelDDRReadBW: 0
InImageDDrReadBW: 380192
ReadBW: 380320
WriteBW: 0
CycleCount: 129138


===========================
**********Show Perf********
===========================
layer_id:6 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:4
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_POOLING (upstream_layer_id:1 upstream_layer_name:PoolingLayer2)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_CONVOLUTION (downstream_layer_id:7 downstream_layer_name:ConvolutionReluPoolingLayer2)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 54
OrigInImageY: 54
OrigInImageZ: 32
NNOutImageX: 52 (sub: 52)
NNOutImageY: 52 (sub: 52)
NNOutImageZ: 64 (sub: 64)
FinalOutImageX: 26
FinalOutImageY: 26
FinalOutImageZ: 64
KernelX: 3
KernelY: 3
KernelZ: 32
PoolingSize: 2
PoolingStride: 2
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 19841
kernelSize: 20608
SrcBuf: VIP_SRAM
DstBuf: VIP_SRAM
KernelBuf: DDR
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_FULL_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 1.000000000000000
coefCompression: 0.934931506849315
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182418800017408
coefCompression_llu: 4606596333917003439
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 52
OutImageTileYSize: 6
KernelsPerCore: 4

kernelDDRReadBW: 17809
InImageDDrReadBW: 0
ReadBW: 17937
WriteBW: 0
CycleCount: 47726


===========================
**********Show Perf********
===========================
layer_id:7 layer_name:ConvolutionReluPoolingLayer2
operation_id:0 operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN
abs_op_id:5
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_CONVOLUTION (upstream_layer_id:6 upstream_layer_name:ConvolutionReluPoolingLayer2)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_TENSOR_TRANS (downstream_layer_id:2 downstream_layer_name:TensorTranspose)
NumUsedNNCores: 6
ConvOutFIFODepth: 168

OrigInImageX: 26
OrigInImageY: 26
OrigInImageZ: 64
NNOutImageX: 24 (sub: 24)
NNOutImageY: 24 (sub: 24)
NNOutImageZ: 128 (sub: 128)
FinalOutImageX: 12
FinalOutImageY: 12
FinalOutImageZ: 128
KernelX: 3
KernelY: 3
KernelZ: 64
PoolingSize: 2
PoolingStride: 2
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 76726
kernelSize: 79744
SrcBuf: VIP_SRAM
DstBuf: VIP_SRAM
KernelBuf: DDR
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_FULL_CACHE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 0.999959309895833
coefCompression: 0.897413793103448
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607182052296141483
coefCompression_llu: 4606258404393712082
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

OutImageTileXSize: 24
OutImageTileYSize: 16
KernelsPerCore: 5

kernelDDRReadBW: 66293
InImageDDrReadBW: 0
ReadBW: 66421
WriteBW: 0
CycleCount: 40241


===========================
**********Show Perf********
===========================
layer_id:2 layer_name:TensorTranspose
operation_id:0 operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP
abs_op_id:6
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_CONVOLUTION (upstream_layer_id:7 upstream_layer_name:ConvolutionReluPoolingLayer2)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_FULLYCONNECTED (downstream_layer_id:8 downstream_layer_name:FullyConnectedReluLayer)
InImageX: 12
InImageY: 12
InImageZ: 128
OutImageX: 128 (sub: 128)
OutImageY: 12 (sub: 12)
OutImageZ: 12 (sub: 12)
KernelX: 1
KernelY: 1
KernelZ: 128
PoolingSize: 1
PoolingStride: 1
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 0
kernelSize: 0
SrcBuf: VIP_SRAM
DstBuf: VIP_SRAM
KernelBuf: DDR
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0

kernelDDRReadBW: 0
InImageDDrReadBW: 0
ReadBW: 128
WriteBW: 0
CycleCount: 11879


===========================
**********Show Perf********
===========================
layer_id:8 layer_name:FullyConnectedReluLayer
operation_id:0 operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP
abs_op_id:7
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_TENSOR_TRANS (upstream_layer_id:2 upstream_layer_name:TensorTranspose)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_FULLYCONNECTED (downstream_layer_id:9 downstream_layer_name:FullyConnectedReluLayer)
InImageX: 1
InImageY: 1
InImageZ: 18432
OutImageX: 1 (sub: 1)
OutImageY: 1 (sub: 1)
OutImageZ: 256 (sub: 256)
KernelX: 1
KernelY: 1
KernelZ: 18432
PoolingSize: 1
PoolingStride: 1
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 7078638
kernelSize: 0
SrcBuf: VIP_SRAM
DstBuf: VIP_SRAM
KernelBuf: DDR
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 0.972156100802951
coefCompression: 1.493328270774571
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4606931623251920668
coefCompression_llu: 4609404171816449099
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

kernelDDRReadBW: 2113922
InImageDDrReadBW: 0
ReadBW: 2114050
WriteBW: 0
CycleCount: 558736


===========================
**********Show Perf********
===========================
layer_id:9 layer_name:FullyConnectedReluLayer
operation_id:0 operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP
abs_op_id:8
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_FULLYCONNECTED (upstream_layer_id:8 upstream_layer_name:FullyConnectedReluLayer)
downstream_layer_num:1 downstream_opertaion_num:1
0) downstream_operation_id:0 downstream_operation_name:VXNNE_OPERATOR_SOFTMAX (downstream_layer_id:3 downstream_layer_name:Softmax2Layer)
InImageX: 1
InImageY: 1
InImageZ: 256
OutImageX: 1 (sub: 1)
OutImageY: 1 (sub: 1)
OutImageZ: 6 (sub: 6)
KernelX: 1
KernelY: 1
KernelZ: 256
PoolingSize: 1
PoolingStride: 1
InputDataSize: 8
OutputDataSize: 8
FP16: 0
archModel_kernelSize: 0
kernelSize: 0
SrcBuf: VIP_SRAM
DstBuf: DDR
KernelBuf: DDR
KernelCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
ImageCacheMode=VXNNE_SRAM_CACHE_MODE_NONE
xOffset: 0, yOffset: 0
coefNonZeroRatio: 0.994791666666667
coefCompression: 32.615384615384613
imageCompression: 1.000000000000000
imageNonZeroRatio: 0.300000000000000

coefNonZeroRatio__llu: 4607135506303898965
coefCompression_llu: 4629787024622011628
imageCompression_llu: 4607182418800017408
imageNonZeroRatio_llu: 4599075939470750515

kernelDDRReadBW: 15029
InImageDDrReadBW: 0
ReadBW: 15157
WriteBW: 6
CycleCount: 6397


===========================
**********Show Perf********
===========================
layer_id:3 layer_name:Softmax2Layer
operation_id:0 operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SH
abs_op_id:9
upstream_layer_num:1 upstream_opertaion_num:1
0) upstream_operation_id:0 uptream_operation_name:VXNNE_OPERATOR_FULLYCONNECTED (upstream_layer_id:9 upstream_layer_name:FullyConnectedReluLayer)
downstream_layer_num:0 downstream_opertaion_num:0
prev_ptrs = 0xffffa369c040

Warning: swapHandel, CMD changed

 NN/TP: pre_physical:0x1FE2C040, new_physical:0x1FE2C040 
layer id: 0 layer name:TensorTranspose operation[0]:VXNNE_OPERATOR_TENSOR_TRANS target:VXNNE_OPERATION_TARGET_TP.
execution time:       290 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        77 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        63 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        80 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        80 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        80 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        74 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        76 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        84 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        76 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        73 us
layer id: 1 layer name:PoolingLayer2 operation[0]:VXNNE_OPERATOR_POOLING target:VXNNE_OPERATION_TARGET_TP.
execution time:       209 us
layer id: 6 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:       140 us
layer id: 7 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:       102 us
layer id: 2 layer name:TensorTranspose operation[0]:VXNNE_OPERATOR_TENSOR_TRANS target:VXNNE_OPERATION_TARGET_TP.
execution time:       101 us
layer id: 8 layer name:FullyConnectedReluLayer operation[0]:VXNNE_OPERATOR_FULLYCONNECTED target:VXNNE_OPERATION_TARGET_TP.
execution time:       469 us
layer id: 9 layer name:FullyConnectedReluLayer operation[0]:VXNNE_OPERATOR_FULLYCONNECTED target:VXNNE_OPERATION_TARGET_TP.
execution time:        54 us
layer id: 3 layer name:Softmax2Layer operation[0]:VXNNE_OPERATOR_SOFTMAX target:VXNNE_OPERATION_TARGET_SH.
execution time:       187 us
Warmup time: 3602.98 ms
Original image size: 600x600x3
Cropped image size: 600x600x3
Resized image size: 224x224x3
Input tensor index: 14
Input tensor name: conv2d_input
Selected order of channels: RGB
Selected pixel values range: NA
Filling time: 0.195005 ms
prev_ptrs = 0xffffa369c040

Warning: swapHandel, CMD changed

 NN/TP: pre_physical:0x1FE2C040, new_physical:0x1FE2C040 
layer id: 0 layer name:TensorTranspose operation[0]:VXNNE_OPERATOR_TENSOR_TRANS target:VXNNE_OPERATION_TARGET_TP.
execution time:       286 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        77 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        59 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        78 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        74 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        81 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        72 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        74 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        73 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        74 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        88 us
layer id: 1 layer name:PoolingLayer2 operation[0]:VXNNE_OPERATOR_POOLING target:VXNNE_OPERATION_TARGET_TP.
execution time:       200 us
layer id: 6 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:       105 us
layer id: 7 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        88 us
layer id: 2 layer name:TensorTranspose operation[0]:VXNNE_OPERATOR_TENSOR_TRANS target:VXNNE_OPERATION_TARGET_TP.
execution time:        82 us
layer id: 8 layer name:FullyConnectedReluLayer operation[0]:VXNNE_OPERATOR_FULLYCONNECTED target:VXNNE_OPERATION_TARGET_TP.
execution time:       154 us
layer id: 9 layer name:FullyConnectedReluLayer operation[0]:VXNNE_OPERATOR_FULLYCONNECTED target:VXNNE_OPERATION_TARGET_TP.
execution time:        48 us
layer id: 3 layer name:Softmax2Layer operation[0]:VXNNE_OPERATOR_SOFTMAX target:VXNNE_OPERATION_TARGET_SH.
execution time:       131 us
Inference time 1: 2.49207 ms
prev_ptrs = 0xffffa369c040

Warning: swapHandel, CMD changed

 NN/TP: pre_physical:0x1FE2C040, new_physical:0x1FE2C040 
layer id: 0 layer name:TensorTranspose operation[0]:VXNNE_OPERATOR_TENSOR_TRANS target:VXNNE_OPERATION_TARGET_TP.
execution time:       240 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        74 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        57 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        87 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        81 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        80 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        78 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        81 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        86 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        77 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        73 us
layer id: 1 layer name:PoolingLayer2 operation[0]:VXNNE_OPERATOR_POOLING target:VXNNE_OPERATION_TARGET_TP.
execution time:       209 us
layer id: 6 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:       108 us
layer id: 7 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        90 us
layer id: 2 layer name:TensorTranspose operation[0]:VXNNE_OPERATOR_TENSOR_TRANS target:VXNNE_OPERATION_TARGET_TP.
execution time:        84 us
layer id: 8 layer name:FullyConnectedReluLayer operation[0]:VXNNE_OPERATOR_FULLYCONNECTED target:VXNNE_OPERATION_TARGET_TP.
execution time:       157 us
layer id: 9 layer name:FullyConnectedReluLayer operation[0]:VXNNE_OPERATOR_FULLYCONNECTED target:VXNNE_OPERATION_TARGET_TP.
execution time:        48 us
layer id: 3 layer name:Softmax2Layer operation[0]:VXNNE_OPERATOR_SOFTMAX target:VXNNE_OPERATION_TARGET_SH.
execution time:       136 us
Inference time 2: 2.47457 ms
prev_ptrs = 0xffffa369c040

Warning: swapHandel, CMD changed

 NN/TP: pre_physical:0x1FE2C040, new_physical:0x1FE2C040 
layer id: 0 layer name:TensorTranspose operation[0]:VXNNE_OPERATOR_TENSOR_TRANS target:VXNNE_OPERATION_TARGET_TP.
execution time:       254 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        69 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        60 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        82 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        77 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        77 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        73 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        76 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        73 us
layer id: 4 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        76 us
layer id: 5 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        73 us
layer id: 1 layer name:PoolingLayer2 operation[0]:VXNNE_OPERATOR_POOLING target:VXNNE_OPERATION_TARGET_TP.
execution time:       210 us
layer id: 6 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:       107 us
layer id: 7 layer name:ConvolutionReluPoolingLayer2 operation[0]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_NN.
execution time:        89 us
layer id: 2 layer name:TensorTranspose operation[0]:VXNNE_OPERATOR_TENSOR_TRANS target:VXNNE_OPERATION_TARGET_TP.
execution time:        83 us
layer id: 8 layer name:FullyConnectedReluLayer operation[0]:VXNNE_OPERATOR_FULLYCONNECTED target:VXNNE_OPERATION_TARGET_TP.
execution time:       155 us
layer id: 9 layer name:FullyConnectedReluLayer operation[0]:VXNNE_OPERATOR_FULLYCONNECTED target:VXNNE_OPERATION_TARGET_TP.
execution time:       185 us
layer id: 3 layer name:Softmax2Layer operation[0]:VXNNE_OPERATOR_SOFTMAX target:VXNNE_OPERATION_TARGET_SH.
execution time:       151 us
Inference time 3: 2.61483 ms
Average inference time: 2.52716 ms
Total prediction time: 2.72216 ms
Output tensor index: 5
Output tensor name: activation_5/Softmax
Top results:
 1	Red Apple
prev_ptrs = 0xffffa369c040
Exit VX Thread: 0xa3ee5fb0

Version 2B[edit | edit source]

The execution of the version 2B of the classifier on the embedded platform is detailed below. As before, htop was used to monitor the system.

root@imx8mpevk:/home/mathias/devel/image_classifier_eIQ_plus# python3 image_classifier.py -m my_fruits_model_qatlegacy.tflite -l labels.txt -i testdata/red-apple1.jpg 
INFO: Created TensorFlow Lite delegate for NNAPI.
Applied NNAPI delegate.
Warm-up time: 3474.22 ms
Original image size: (600, 600)
Cropped image size: (600, 600)
Resized image size: (224, 224)
Filling time: 0.72 ms
Inference time 1: 1.44 ms
Inference time 2: 1.38 ms
Inference time 3: 1.39 ms
Average inference time: 1.40 ms
Total prediction time: 2.12 ms
Results:
  1.000 Red Apple
  0.000 Orange
  0.000 Hand

Note that the inference time is close to the C++ one, but the filling time (needed to fill the input tensor with the image) is slower. This because Python doesn't allow some low-level operations with pointers like C++.

The following screenshot shows the system status while executing the application.


ML-TN-001 4 acceleration python.png

Version 3[edit | edit source]

The following image shows the execution of the third version of the classifier on the embedded platform. The image sensor is pointed at a red apple which is correctly classified with 98% confidence. Note that with this camera, the frame rate is capped at 30 fps, but it could be way higher because the inference on NPU only takes few milliseconds as shown before.


Version 3 of the application running on the i.MX8 Plus EVK


During the execution, htop was used to monitor the system. The following screenshot shows the system status while executing the application.


htop screenshot during the execution of the classifier version 3

Results[edit | edit source]

Version 1[edit | edit source]

The following table lists the prediction times for a single image depending on the model and the thread parameter.

Prediction times
Model Threads parameter Prediction time

[ms]

Floating-point unspecified 89
1 160
2 130
Half-quantized unspecified 180
Fully-quantized unspecified 85
4 29

The prediction time takes into account the inference time and the time needed to fill the input tensor with the image. Furthermore, the inference time is averaged over several inferences.

The same tests were repeated using a network file system (NFS) over an Ethernet connection, too. No significant variations in the prediction times were observed.

In conclusion, to maximize the performance in terms of execution time, the model has to be fully-quantized and the number of threads has to be specified explicitly.

Version 2A and 3[edit | edit source]

In this case, only the fully-quantized model could be tested and the thread number has no effect.

Prediction times
Model Prediction time

[ms]

Fully-quantized 1.5

Version 2B[edit | edit source]

Prediction times
Model Prediction time

[ms]

Fully-quantized 2.1

Results comparison[edit | edit source]

The following table compares the results achieved to the ones measured on the i.MX8M-based Mito8M SoM.

Prediction times
Platform BSP TensorFlow Lite ARM cores

(# / Type / Max freq. [GHz])

Acceleration Model Threads Prediction time

[ms]

Notes
NXP i.MX8M-based Mito8M SoM L4.14.98_2.0.0 1.12 4 / Cortex-A53 / 1.3 no Floating-point unspecified (4) 220
1 220
2 390
Half-quantized unspecified (4) 330
Fully-quantized unspecified (1) 200
4 84
NXP i.MX8M Plus EVK L5.4.24_2.1.0 2.1 4 / Cortex-A53 / 1.8 no

(version 1)

Floating-point unspecified (4) 89
1 160
2 130
Half-quantized unspecified (4) 180
Fully-quantized unspecified (1) 85
4 29 Interestingly, this time is significantly smaller than the one measured on the i.MX8M (84 ms). Probably, this is due to improvements at the TFL inference engine level, besides the increased maximum ARM frequency.
NPU

(version 2A: C++)

Fully-quantized NA 1.5
NPU

(version 2B: Python)

Fully-quantized NA 2.1 See also section Version 2B (Python).