This is an introduction to「PicoDet」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.
PicoDet is a machine learning model released in November 2021. It integrates recent research results on object detection models into a lightweight model for mobile CPUs to achieve high accuracy and high speed object detection.
PicoDet improves the speed of feature extraction by using a lightweight structure as backbone. It also improves the stability and efficiency of training by improving the loss function.
Anchor-free detectors have been increasingly popular in object detection in recent years, and Fully Convolutional One-Stage Object Detection (FCOS) solves the problem of overlapping ground-truth labels. While a typical anchor box has multiple anchors for each coordinate, FCOS has a single center point for each coordinate; an anchor-free approach using FCOS has the advantage of not requiring hyperparameter adjustment.
However, common anchor-free detectors are used in models for server-side processing, which are relatively large in size. Anchor-free models for mobile applications are limited to the NanoDet and YOLOX-Nano.
It is hard for lightweight anchor-free detectors to balance accuracy and efficiency. PicoDet is a new attempt inspired by FCOS and Generalized Focal Loss (GFL).
CSP is an evolution of the “skip connection” mechanism used in ResNet for example. It facilitates reverse propagation and reduces the amount of operations by adding a mechanism to cut and concatenate the feature map of the previous stage without going computing convolutions. In PicoDet, the 3×3 depthwise convolution is extended to a 5×5 depthwise convolution to expand the reception Field.
To improve the label assignment strategy, SimOTA is employed, and Varifocal Loss (VFL) and GIoU loss are employed for the loss function.
SimOTA is also used in YOLOX. When determining the mapping between the predicted bounding box and the ground-truth bounding box to calculate the loss, instead of assigning the closest ground-truth, the method solves an optimization problem to assign a more appropriate ground-truth. SimOTA is a faster version of OTA (Optimal Transport Assignment).
In this object detection model, learning is performed by assigning a ground-truth bounding box to each bounding box predicted by the HEAD and back-propagating the loss.
It happens that ground-truth bounding boxes overlap each other. If a prediction bounding box falls within the area, we have a situation called an “ambiguous anchor” for which we don’t know which ground-truth to assign. OTA has an algorithm that makes it difficult to assign a ground-truth bounding box to an ambiguous anchor.
Below is an example of the result of assignment by OTA. The dots in the images are the center point of the predicted bounding boxes, and the red ovals show the difference in assignment strategies can be seen.
The backbone for feature extraction in PicoDet is Enhanced ShuffleNet, an improved version of ShuffleNetV2, an efficient model architecture for mobile. ShuffleNet introduces “pointwise group convolutions” and “channel shuffle” operations to speed up the 1×1 convolution, the bottleneck of MobileNet.
One-Shot Neural Architecture Search (NAS) was introduced to search for the optimal number of channels for each layer. The search results showed that making the number of channels a multiple of 8 contributed the most to improving inference speed.
Here are performance using Qualcomm’s Snapdragon 865 CPU. YOLOX-Tiny requires 32.77 ms at mAP 32.8 when run on a CPU using NCNN, whereas PicoDet requires 12.37 ms at mAP 30.6. An mAP 34.3 can be achieve in 17.39ms.
This increase in performance can be measured when running the inference on CPU, whereas they both perform identically when running on GPU. As a result YOLOX-Tiny might be more suited for Jetson devices, whereas PicoDet might be best for devices such as RaspberryPi.
PicoDet can be used with ailia SDK with the following command to detect object in the webcam video stream.
$ python3 picodet.py -v 0