In this comprehensive case study, we are diving into the process of fine-tuning the YOLOv7 pre-trained model, empowering it to achieve higher accuracy when identifying specific object classes.
Before outlining the detailed steps and parameters in a step by step approach, let's talk about YOLOv7 and its characteristics.
YOLO stands for “You Only Look Once”; it is a popular family of real-time object detection algorithms. The original YOLO object detector was first released in 2016. It was created by Joseph Redmon, Ali Farhadi, and Santosh Divvala. At release, this architecture was much faster than other object detectors and became state-of-the-art for real-time Computer Vision applications.
YOLO (You Only Look Once) has gained popularity in the field of object detection due to several key factors. making it ideal for real-time applications. Additionally, YOLO achieves higher mean Average Precision (mAP) than other real-time systems, further enhancing its appeal.
Another reason for YOLO's popularity is its high detection accuracy. It outperforms other state-of-the-art models with minimal background errors, making it reliable for object detection tasks.
YOLO also demonstrates good generalization capabilities, especially in its newer versions. It exhibits better generalization for new domains, making it suitable for applications that require fast and robust object detection. For example, studies comparing different versions of YOLO have shown improvements in mean average precision for specific tasks like the automatic detection of melanoma disease.
Furthermore, YOLO's open-source nature has contributed to its success. The community's continuous improvements and contributions have helped refine the model over time.
YOLO's outstanding combination of speed, accuracy, generalization, and open-source nature has positioned it as the leading choice for object detection in the tech community. Its impact in the field of real-time Computer Vision cannot be understated.
The YOLO architecture shares similarities with GoogleNet, featuring convolutional layers, max-pooling layers, and fully connected layers.
The architecture follows a streamlined approach to object detection and works as follows:
- Starts by resizing the input image to a fixed size, typically 448x448 pixels.
- This resized image is then passed through a series of convolutional layers, which extract features and capture spatial information.
- The YOLO architecture employs a 1x1 convolution followed by a 3x3 convolution to reduce the number of channels and generate a cuboidal output.
- The Rectified Linear Unit (ReLU) activation function is used throughout the network, except for the final layer, which utilizes a linear activation function.
To improve the model's performance and prevent overfitting, techniques such as batch normalization and dropout are employed. Batch normalization normalizes the output of each layer, making the training process more stable. Dropout randomly ignores a portion of the neurons during training, which helps prevent the network from relying too heavily on specific features.
In terms of how YOLO performs object detection, it follows a four-step approach:
1. First, the image is divided into grid cells (SxS) responsible for localizing and predicting the object's class and confidence values.
2. Next, bounding box regression is used to determine the rectangles highlighting the objects in the image. The attributes of these bounding boxes are represented by a vector containing probability scores, coordinates, and dimensions.
3. Intersection Over Unions (IoU) is then employed to select relevant grid cells based on a user-defined threshold.
4. Finally, Non-Max Suppression (NMS) is applied to retain only the boxes with the highest probability scores, filtering out potential noise.
Compared to its predecessors, YOLOv7 introduces several architectural reforms that contribute to improved performance. These include:
- Model scaling for concatenation-based models allows the model to meet the needs of different inference speeds.
- E-ELAN (Extended Efficient Layer Aggregation Network) which allows the model to learn more diverse features for better learning.
- Using planned re-parameterized convolution.
- Using coarse for auxiliary and fine for lead loss.
YOLOv7 introduces a notable improvement in resolution compared to its predecessors. It operates at a higher image resolution of 608 by 608 pixels, surpassing the 416 by 416 resolution employed in YOLOv3. By adopting this higher resolution, YOLOv7 becomes capable of detecting smaller objects more effectively, thereby enhancing its overall accuracy.
These enhancements result in a 13.7% higher Average Precision (AP) on the COCO dataset compared to YOLOv6.
The YOLOv7 model has six versions with varying parameters and FPS (Frames per Second) performance. Here are the details:
The Ikomia API serves as a game-changer, streamlining the development of Computer Vision workflows and enabling effortless experimentation with various parameters to unlock remarkable results.
With Ikomia API, we can train a custom YOLOv7 model with just a few lines of code. To get started, you need to install the API in a virtual environment.
How to install a virtual environment
In this tutorial, we will use the aerial airport dataset from Roboflow. You can download this dataset by following this link: Dataset Download Link.
You can also charge directly the open-source notebook we have prepared.
The training process for 10 epochs was completed in approximately 14 minutes using an NVIDIA GeForce RTX 3060 Laptop GPU with 6143.5MB.
With the dataset of aerial images that you downloaded, you can train a custom YOLOv7 model using the Ikomia API.
We initialize a workflow instance. The “wf” object can then be used to add tasks to the workflow instance, configure their parameters, and run them on input data.
The downloaded dataset is in YOLO format, which means that for each image in each folder (test, val, train), there is a corresponding .txt file containing all bounding box and class information associated with airplanes. Additionally, there is a _darknet.labels file containing all class names. We will use the dataset_yolo module provided by Ikomia API to load the custom data and annotations.
We add a train_yolo_v7 task to train our custom YOLOv7 model. We also specify a few parameters:
The “auto_connect=True ” argument ensures that the output of the dataset_yolo task is automatically connected to the input of the train_yolo_v7 task.
Finally, we run the workflow to start the training process.
You can monitor the progress of your training using tools like Tensorboard or MLflow.
Once the training is complete, the train_yolo_v7 task will save the best model in a folder named with a timestamp inside the output_folder. You can find your best.pt model in the weights folder of the timestamped folder.
First, we can run an aerial image on the pre-trained YOLOv7 model:
We can observe that the infer_yolo_v7 default pre-trained model doesn’t detect any plane. This is because the model has been trained on the COCO dataset, which does not contain aerial images of airports. As a result, the model lacks knowledge of what an airplane looks like from above.
To test the model we just trained, we specify the path to our custom model using the ’model_weight_file’ argument. We then run the workflow on the same image we used previously.
For a deeper understanding of the API's capabilities, we recommend referring to the documentation. Additionally, don't miss the opportunity to explore the impressive roster of advanced algorithms available on Ikomia HUB, and take a spin with Ikomia STUDIO, a user-friendly interface that mirrors the API's features.