Machine learning case study: street sign detection
In this article, I am going to share our experiences with an ongoing image detection project called “Street Sign Detection” Machine learning projects require a slightly different approach from traditional software development projects. Because the core of the product is an effective machine learning model and not necessarily the software itself.
What makes machine learning projects special?
The first thing to note with machine learning projects is that it does not matter how good the software or the model is if the input data for training the model is not adequate. Thus, data quality is the single most important aspect of any successful machine learning project, as a model’s upper boundary in terms of performance is defined by the quality of the data.
Another thing of importance is how to build the software around the model, i.e. what is the best solution regarding architecture. Machine learning models are generally computationally expensive and, therefore, require more resources than traditional software.
Deploying a custom deep learning model, for example, on a mobile device might not be the best approach, even if the final product is supposed to be a mobile application. In such a case it would be better to host the model on a server the app can communicate with.
Training an AI to detect street signs
Our goal with the street sign detection project is to detect street signs using a mobile app. The app accesses the device’s camera and sends frames to the server for inference. However, if we send every frame we capture, we quickly run into problems: Sending 30 frames per second, for example, would not give sufficient time for the model to detect objects in each frame. It makes more sense to send every nth frame to the server due to the inference time of the model. Furthermore, taking only every nth frame saves device resources due to the fact that every frame is compressed before sending them to the backend.
The model and the backend are written in Python, and the app is written in Dart using Flutter, a cross-platform development kit. We kept the app simple. It basically has four functionalities:
- connect to the backend
- start camera stream
- stop camera stream
- disconnect from the backend
The backend receives the messages sent from the app in string format. These messages include the base64 encoded string and the coordinates matching the location of the frame, as well as the inference settings for the model.
The model we use is YOLOv3. It is trained on thousands of images that contain the street signs we are interested in. In the training phase the prediction target is the location of the sign in the image and the class the sign belongs to.
Training with 15 different street signs
We can currently detect 15 different street signs. These 15 signs are considered true positives and all other signs are true negatives; we want to detect the signs that belong to one of the 15 classes, and conversely, we do not want to detect signs that do not belong to any of those classes.
Ideally, we should have an equal amount of samples per class, but that is not always the case. To account for class imbalance we use class weights, meaning that the majority class has a coefficient of 1 and minority classes have a coefficient greater than 1 in relation to the majority class. This means that the difference between the predicted value and the actual value is multiplied by the respective class weight in order to give the resulting error term a higher magnitude.
Mitigating class imbalance
Another important thing to keep in mind when training any model is that each sample should be seen by the model at most once per epoch during training. If the same sample is seen more than once, especially more than samples of other classes, then the model could become biased towards that sample and the class it belongs to.
While this bias is inherently present in imbalanced data, it can be mitigated through class weights and/or upsampling minority classes and downsampling the majority class. However, strong imbalance across several classes makes upsampling and downsampling challenging to control.
Another thing to account for during training is to have a roughly equal distribution of samples per class in both training and validation sets. Per epoch, it is not needed to enforce the class distribution because each sample is seen at most once. This means that the distribution will always be roughly equal to the actual class distribution.
Image augmentation is an important part of training computer vision models. The idea of image augmentation is to introduce variance to the model through slight manipulations of training images for better generalization and to avoid overfitting (i.e. the model learns the representations of training samples too well).
Image augmentation is best done on-the-fly, i.e. when preparing each batch for training. We do arbitrary, sample-wise augmentations per batch to guarantee variety in each step through the epoch. We arbitrarily apply scaling, rotating, shearing, and translating to each sample, and also arbitrarily distort each sample before it is added to the batch for training.
By applying these different types of augmentations arbitrarily and on-the-fly, we can probabilistically guarantee that the same exact sample is seen only once throughout the training process.
Batch preparation is done on the CPU. Then, the ready batch is sent to the GPU. To reduce training time, the CPU prepares the batches while the GPU is utilised for training, thus removing bottlenecks in the pipeline. Another desired outcome of data augmentation is that it increases the size of the dataset by creating artificial samples. That is especially good if the actual dataset is not large enough. The smaller the dataset is, the less variance there is between the samples.
The Street Sign Detection project has been an interesting and fun challenge. There are many more aspects to it that weren’t discussed in this article. Like model tuning and ensuring that the solution architecture is scalable. While we are currently working on detecting 15 different classes, there are hundreds of unique street signs in Germany.
Machine learning with P&M
At P&M we have different ongoing projects involving machine learning (or “artificial intelligence”). We are always happy to share our insights. So if you have any questions feel free to contact us. We are happy to hear from you!