Changes

Jump to: navigation, search

ML-TN-002 - Real-time Social Distancing estimation

3,953 bytes added, 13:01, 1 March 2021
The hardware/software platform
{{AppliesToMachineLearning}}
{{AppliesTo Machine Learning TN}}
{{AppliesTo ORCA TN}}
{{AppliesTo ORCA SBC TN}}
{{InfoBoxBottom}}
Because of the Covid-19 pandemic, everyone has learned to deal with the so-called "Social Distancing" rules very well. When it comes to spaces shared by many people — such as squares, public or private offices, malls, etc. — it is not easy to monitor in real-time the compliance with these rules.
Automatic systems that are capable to do the job have been developed. Most of them are implemented as software running on camera-equipped PC's making use of visual techniques. Because of the nature of the problem, this This is not a one-size-fits-all solution, however. In many cases, the use of a properly designed embedded platform is mandatory, for example, because of tight space constraints, harsh environment operability, or cost constraints — requirements that are typical for industrial-grade applications.
To date, though, the computing power required for algorithms that complex has represented a hurdle difficult to overcome, hindering the adoption of embedded platforms for these tasks. Recently, new system-on-chips (SoC's) integrating Neural Network hardware accelerators have appeared on the market, however. Thanks to such an improvement in terms of computational power, these devices allow the implementation of novel solutions satisfying all the above-mentioned requirements.
This Technical Note illustrates one of these implementations regarding the real-time social distancing estimation issue. This work started off the publicly-available , open-source Social-Distancing project released by the [https://iit.it/|Istituto Istituto Italiano di Tecnologia (IIT)], which is illustrated in this [https://arxiv.org/abs/2011.02018v2|paper]. The goal was to port the IIT code onto a one of the DAVE Embedded Systems Single Board Computer (SBC) powered by the [https://wwwwiki.nxpdave.comeu/productsindex.php/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-8-processors/i-mx-8m-plus-arm-cortex-a53-machine-learning-vision-multimedia-and-industrial-iot:IMX8MPLUS|NXP i.MX8M Plus SoCMain_Page#Single_Board_Computers Single Board Computers]. This (SBC) suitable to build an industrial/automotive-grade SoC is built around a 4-core ARM Cortex A53 CPU and has a rich set of peripherals and systems. It also integrates a 2.3 TOPS Neural Processing Unit (NPU) and native interfaces to connect image sensors making it suited automatic machine vision system for this kind of applicationssocial distancing.
==The hardware/software platform==
The hardware platform consists ofchoice fell on the [https:* //wiki.dave.eu/index.php/ORCA_SBC ORCA SBC* TBDRegarding the software platform], it which is based on powered by the [https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-8-processors/i-mx-8m-plus-arm-cortex-a53-machine-learning-vision-multimedia-and-industrial-iot:IMX8MPLUS NXP BSP TBDi. In addition to the default packages, MX8M Plus SoC]. This industrial/automotive-grade SoC is built around a 4-core ARM Cortex-A53 CPU and has a number rich set of libraries were added peripherals and systems. It also integrates a 2.3 TOPS Neural Processing Unit (NPU) and native interfaces to satisfy the application's requirementsconnect image sensors making it suited for computer vision applications.
=== Main application ===As stated previously, the main application derives The system software is a Yocto Linux distribution derived from the IIT Social[https://www.nxp.com/design/software/embedded-software/i-mx-software/embedded-linux-for-i-mx-applications-Distancing projectprocessors:IMXLINUX NXP 5.4.70_2.3.0] BSP. It was developed in several steps starting when only In addition to the default packages, a few alpha samples number of libraries were added to satisfy the i.MX8M Plus were availableapplication's requirements.
== Application software ==As stated previously, the main application derives from the IIT Social-Distancing project. It was developed in several steps starting when only a few alpha samples of the i.MX8M Plus were available thanks to the fact that DAVE Embedded Systems joined the the component's beta program. === Step #1 ===The first step was conducted using the official evaluation kit (EVK) by NXP. Its The goal was to make the Social-Distancing project to work on the new this platform maintaining the core functionalities. In essence, the code was modified in order to replace the [https://github.com/CMU-Perceptual-Computing-Lab/openpose OpenPose library] with [https://github.com/tensorflow/tfjs-models/tree/master/posenet PoseNet]. This was required to cope with the operations actually supported by the [https://www.nxp.com/design/software/development-software/eiq-ml-development-environment:EIQ NXP eIQ] software stack and the NPU. For those who are familiar with embedded software development, this should be unsurprising. When porting applications from PC-like platforms to embedded platforms, in fact, handling such hardware/software constraints is a common practice.
The resulting processing pipeline is shown in the following figure.
 
[[File:Ss-main-pipeline-20210127.png|center|thumb|600x600px|Processing pipeline]]
 
The yellow boxes indicate processing performed by the ARM cores, while the green one refers to the computation carried out by the NPU.
The following screenshots show the application running on the EVK.
[[File:Social-distancing-screenshot1.png|center|thumb|600x600px|captionThe step 1 application running on the EVK (1/2)]] [[File:Social-distancing-screenshot2.png|center|thumb|600x600px|The step 1 application running on the EVK (2/2)]] It is worth remembering that, even though OpenPose was replaced, the software interface between high-level layers and PoseNet was not altered allowing to keep untouched these layers. === Step #2 ===Step #2 concerned implementing some optimizations in order to increase the overall frame rate. As usual, before implementing any optimization, the code was profiled in order to detect the portions that made sense to optimize. In addition to traditional, well-know techniques, specific NPU-related tools were used as well. For instance, the following dump shows a detailed report referring to the execution of a Convolutional Neural Network (CNN) on the accelerator. {| class="wikitable"|+Example of NPU profiling report!LAYER ID!LAYER NAME!OPERATION ID!OPERATION TYPE!TARGET!CYCLES!READ BW [MByte]!WRITE BW [MByte]!AXI READ BW [MByte]!AXI WRITE BW [MByte]!DDR READ BW [MByte]!DDR WRITE BW [MByte]!TIME [μs]|-|0|TensorTranspose|0|TENSOR_TRANS|TP|482613|0.491743|0.445310|0.000000|0.000000|0.491743|0.445310|631|-|20|ConvolutionReluPoolingLayer2|0|RESHUFFLE|TP|1822|0.002380|0.000000|0.000000|0.000000|0.002380|0.000000|136|-|20|ConvolutionReluPoolingLayer2|0|RESHUFFLE|TP|402743|0.251754|0.000000|0.000000|0.000000|0.251754|0.000000|539|-|...|...|...|...|...|...|...|...|...|...|...|...|...|} Combining the results of profiling with a manual analysis of the code, it was decided to work on the operations performed before the inference. Basically, these tasks were restructured to implement a parallel computation for the purpose of leveraging the quad-core ARM Cortex-A53 cluster. The resulting architecture is depicted in the following figure.[[File:Ss-main-pipeline-v2-20210204.png|center|thumb|600x600px|Processing pipeline after implementing parallel computations]] ===Step #3===In this step, the application was migrated to the definitive hardware platform, the aforementioned ORCA SBC, which was designed while the software team was working on the EVK. ==Testing==The following clip shows the application running on the ORCA SBC.  {| class="wikitable" | width="100%"| {{#ev:youtube|HAAH2bTVrXM|600|center|Social Distancing application running on ORCA SBC|frame}}|}  In the example, the system was fed with a 640x360 25fps stream. On average, the frame rate of the processed stream is 23 fps. This screenshot illustrates the CPU load during the execution of the application. As expected, the 4 ARM cores are almost fully loaded because of parallel computation implemented in the algorithm.
[[File:Social-distancing-screenshot2.png|center|thumb|600x600px|c]]
== Testing and results ==[[File:Social-distancing-htop1.png|center|thumb|600px|CPU load during the execution of the application]]
== Conclusions ==
== Future work ==For convenience, this test was run using an MPEG4 video file as input. Well-known [https://opencv.org/ OpenCV] libraries were used to decompress the video and to retrieve the frames. At the time of this writing, these libraries did not support i.MX8M Plus's hardware video decoder. As such, it should be taken into account that video decompression is carried out by the ARM cores as well. Thus, in the case of an uncompressed live stream captured from a camera, it is expected to have further processing headroom for the core computations.
8,184
edits

Navigation menu