Changes

← Older edit

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3

389 bytes added, 13:26, 5 January 2021

no edit summary

==Building the application==

The starting point for the application is the ~~model—in the form of a TensorFlow protobuf file (.pb)—described~~ model described [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Reference_application_.231:_fruit_classifier|here]]. Incidentally, ~~this is~~ the '''same''' ~~protobuf file~~ model structure was used as starting point for [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_2|this other test]] as well(*). This makes the comparison of the two tests straightforward, even though they were run on SoC's that differ significantly from the architectural standpoint.

(*) The two models share the same structure but, as they are trained independently, their weights differ.

===Training the model===

Model training is performed with the help of the Docker container provided by Vitis AI.

</pre>

Within the scope of this TN, the most relevant time is ''DPU tot time'', which indicates the time spent to execute the inference (~3.7ms). This leads to a throughput of about 271 fps.

====Fine grained profiling using DNNDK low level API====

|}

Interestingly, having four threads—i.e. the same number of CPU cores—allows to ~~furtherly~~ further increment the throughput by a factor of almost 2 , while keeping the DPU cores occupation low. It should not be forgotten, in fact, that part of the algorithm does make use of the CPU computational power as well.

=====Six threads=====

==Results==

In the following table, the throughputs achieved by ~~the~~ different versions of the ~~tested~~ application are summarized.

|DNNDK

|1

|271

|-

| rowspan="4" |VART

|1

|245

|-

|2

|442

|-

|4

|818

|-

|6

|830

|}

It is worth mentioning that~~, when~~ *When the number of threads is greater than 1, the latency of the DPU_0 is higher than the latency of the DPU_1 , although they are equivalent in terms of hardware configuration. To date, this fact is still unexplained.*Increasing the number of threads of the VART-based application beyond 6 does not further increase the achieved throughput.

U0009

dave_user, Administrators

5,178

edits

DAVE Developer's Wiki β

Changes

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3

DAVE Developer's Wiki ^β