Sat. May 25th, 2024

YOLO v8: The Most Powerful Object Detection Algorithm

YOLO — You only look once, real time object detection explained

Introducing Ultralytics YOLOv8, the latest version of the acclaimed real-time object detection and image segmentation model. YOLOv8 is built on cutting-edge advancements in deep learning and computer vision, offering unparalleled performance in terms of speed and accuracy. Its streamlined design makes it suitable for various applications and easily adaptable to different hardware platforms, from edge devices to cloud APIs.

Explore the YOLOv8 Docs, a comprehensive resource designed to help you understand and utilize its features and capabilities. Whether you are a seasoned machine learning practitioner or new to the field, this hub aims to maximize YOLOv8’s potential in your projects.

Where to Start

  • Install ultralytics with pip and get up and running in minute
  • Predict new images and videos with YOLOv8
  • Train a new YOLOv8 model on your own custom dataset

YOLO: A Brief History

YOLO (You Only Look Once), a popular object detection and image segmentation model, was developed by Joseph Redmon and Ali Farhadi at the University of Washington. Launched in 2015, YOLO quickly gained popularity for its high speed and accuracy.

  • YOLOv2, released in 2016, improved the original model by incorporating batch normalization, anchor boxes, and dimension clusters.
  • YOLOv3, launched in 2018, further enhanced the model’s performance using a more efficient backbone network, multiple anchors and spatial pyramid pooling.
  • YOLOv4 was released in 2020, introducing innovations like Mosaic data augmentation, a new anchor-free detection head, and a new loss function.
  • YOLOv5 further improved the model’s performance and added new features such as hyperparameter optimization, integrated experiment tracking and automatic export to popular export formats.
  • YOLOv6 was open-sourced by Meituan in 2022 and is in use in many of the company’s autonomous delivery robots.
  • YOLOv7 added additional tasks such as pose estimation on the COCO keypoints dataset.
  • YOLOv8 is the latest version of YOLO by Ultralytics. As a cutting-edge, state-of-the-art (SOTA) model, YOLOv8 builds on the success of previous versions, introducing new features and improvements for enhanced performance, flexibility, and efficiency. YOLOv8 supports a full range of vision AI tasks, including detectionsegmentationpose estimationtracking, and classification. This versatility allows users to leverage YOLOv8’s capabilities across diverse applications and domains.

YOLO Licenses: How is Ultralytics YOLO licensed?

Ultralytics offers two licensing options to accommodate diverse use cases:

  • AGPL-3.0 License: This OSI-approved open-source license is ideal for students and enthusiasts, promoting open collaboration and knowledge sharing. See the LICENSE file for more details.
  • Enterprise License: Designed for commercial use, this license permits seamless integration of Ultralytics software and AI models into commercial goods and services, bypassing the open-source requirements of AGPL-3.0. If your scenario involves embedding our solutions into a commercial offering, reach out through Ultralytics Licensing.

Our licensing strategy is designed to ensure that any improvements to our open-source projects are returned to the community. We hold the principles of open source close to our hearts ❤️, and our mission is to guarantee that our contributions can be utilized and expanded upon in ways that are beneficial to all.

Experiment with a YOLOv8 model trained on the Microsoft COCO dataset.


Upload or  Webcam

What is YOLOv8?

YOLOv8 is a new state-of-the-art computer vision model built by Ultralytics, the creators of YOLOv5. The YOLOv8 model contains out-of-the-box support for object detection, classification, and segmentation tasks, accessible through a Python package as well as a command line interface.

How do you install YOLOv8?

To install YOLOv8, run the following command:

This command will install the latest version of the YOLOv8 library. You can then use the model with the “yolo” command line program or by importing the model into your script using the following python code.

How do you use YOLOv8?

You can use the YOLOv8 model in your Python code or via the model CLI.

Use the CLI

To train a model with the YOLOv8 CLI, use this command:

Add the source to the image on which you want to run inference. This will use the default YOLOv8s model weights to make a prediction. To learn more about training a custom model on YOLOv8, keep reading!

Use the Python Package

To use the Python CLI, first import the “ultralytics” package into your code. Then, you can use the package to load, train, and use a model.

Load a Custom Model

To load a custom model into your project, use the following code:

This code loads the default YOLOv8n model weights and trains a model using the COCO dataset for 100 epochs. You may want to run this code in a Google Colab so that you can keep your trained model in memory for experimentation.

You can replace the “yolov8n” text with the name of the model you want to use. You can learn more about the different model sizes available in the Ultralytics YOLOv8 GitHub repository.

Create a New Model (Advanced)

While loading a model using the default YOLOv8n weights is recommended, you can train a new model from scratch using the Python package too. To do so, provide a YOLOv5 PyTorch TXT file that contains information about the dataset on which you want to train your model:

What are the main features in YOLOv8?

YOLOv8 comes with both architectural and developer experience improvements.

Compared to YOLOv8’s predecessor, YOLOv5, YOLOv8 comes with:

1. A new anchor-free detection system.
2. Changes to the convolutional blocks used in the model.
3. Mosaic augmentation applied during training, turned off before the last 10 epochs.

Furthermore, YOLOv8 comes with changes to improve developer experience with the model. First, the model now comes packaged as a library you can install in your Python code. A quick “pip install ultralytics” will give you the

Who created YOLOv8?

YOLOv8 was built by Ultralytics. The code for YOLOv8 is open source and licensed under a GPL license.

Object Detection in 2023: The Definitive Guide


  • What object detection is and how it has evolved over the past 20 years
  • Types of computer vision object detection methods
  • We list examples, use cases, and object detection applications
  • The most popular object detection algorithms today
  • New object recognition algorithms
About: At, we provide the end-to-end computer vision platform Viso Suite. The platform enables teams to build and deliver all their real-world computer vision applications in one place. Get the whitepaper and a demo for your company.

Viso Suite is an all-in-one workspace for teams to deliver AI vision applications faster and without overhead.
What is Object Detection?
Object detection is an important computer vision task used to detect instances of visual objects of certain classes (for example, humans, animals, cars, or buildings) in digital images such as photos or video frames. The goal of object detection is to develop computational models that provide the most fundamental information needed by computer vision applications: “What objects are where?”.

Object Detection is a basic Computer Vision task to detect and localize objects in images and video. – Built on Viso Suite
Person Detection
Person detection is a variant of object detection used to detect a primary class “person” in images or video frames. Detecting people in video streams is an important task in modern video surveillance systems. The recent deep learning algorithms provide robust person detection results. Most modern person detector techniques are trained on frontal and asymmetric views.
However, deep learning models such as YOLO that are trained for person detection on a frontal view data set still provide good results when applied for overhead view person counting (TPR of 95%, FPR up to 0.2%). See how companies use Viso Suite to build a custom people counting solution with deep learning for video analysis.

Real-time person detection in manufacturing production lines
Why is Object Detection important?
Object detection is one of the fundamental problems of computer vision. It forms the basis of many other downstream computer vision tasks, for example, instance and image segmentation, image captioning, object tracking, and more. Specific object detection applications include pedestrian detection, animal detection, vehicle detection, people counting, face detection, text detection, pose detection, or number-plate recognition.

Google MediaPipe Box Tracking paired with ML inference for Object Detection
Object Detection and Deep Learning
In the last few years, the rapid advances in deep learning techniques have greatly accelerated the momentum of object detection technology. With deep learning networks and the computing power of GPUs, the performance of object detectors and trackers has greatly improved, achieving significant breakthroughs in object detection.

Applied AI system based on the YOLOv7 algorithm trained for aircraft detection – Built on Viso Suite
Machine learning (ML) is a branch of artificial intelligence (AI), and it essentially involves learning patterns from examples or sample data as the machine accesses the data and has the ability to learn from it (supervised learning on annotated images).
Deep Learning is a specialized form of machine learning which involves learning in different stages. To learn more about the technological background, check out our article: What’s the difference between Machine Learning and Deep Learning?
Latest technological advances in computer vision
Deep Learning object detection and tracking are the fundamental basis of a wide range of modern computer vision applications. For example, the detection of objects enables intelligent healthcare monitoring, autonomous driving, smart video surveillance, anomaly detection, robot vision, and much more. Each AI vision application usually requires a combination of different algorithms that form a flow (pipeline) of multiple processing steps.

Computer Vision Applications built and delivered with Viso Suite
AI imaging technology has greatly progressed in recent years. A wide range of cameras can be used, including commercial security and CCTV cameras. By using a cross-compatible AI software platform like Viso Suite, there is no need to buy AI cameras with built-in image recognition capabilities, because the digital video stream of essentially any video camera can be analyzed using object detection models. As a result, applications become more flexible as they no longer depend on custom sensors, expensive installation, and embedded hardware systems that must be replaced every 3-5 years.
Meanwhile, computing power has dramatically increased and is becoming much more efficient. In past years, computing platforms moved toward parallelization through multi-core processing, graphical processing units (GPU), and AI accelerators such as tensor processing units (TPU)
Such hardware allows applying computer vision for object detection and tracking in near real-time environments. Hence, rapid development in deep convolutional neural networks (CNN) and GPU’s enhanced computing power are the main drivers behind the great advancement of computer vision based object detection.
Those advances enabled a key architectural concept called Edge AI. This concept is also called Intelligent Edge or Distributed Edge. It moves heavy AI workloads from the Cloud closer to the data source. This results in distributed, scalable, and much more efficient systems that allow the use of computer vision in business and mission-critical systems.
Edge AI involves IoT or AIoT, on-device machine learning with Edge Devices, and requires complex infrastructure. At, we enable organizations to build, deploy and scale their object detection applications while taking advantage of all those cutting-edge technologies. You can get the Whitepaper here.

End-to-end computer vision application platform Viso Suite
Disadvantages and Advantages of Object Detection
Object detectors are incredibly flexible and can be trained for a wide range of tasks and custom, special-purpose applications. The automatic identification of objects, persons, and scenes can provide useful information to automate tasks (counting, inspection, verification, etc.) across the value chains of businesses.
However, the main disadvantage of object detectors is that they are computationally very expensive and require significant processing power. Especially, when object detection models are deployed at scale, the operating costs can quickly increase and challenge the economic viability of business use cases. Learn more in our related article What Does Computer Vision Cost?
How Object Detection works
Object detection can be performed using either traditional (1) image processing techniques or modern (2) deep learning networks.
  1. Image processing techniques generally don’t require historical data for training and are unsupervised in nature. OpenCV is a popular tool for image processing tasks.
    • Pro’s: Hence, those tasks do not require annotated images, where humans labeled data manually (for supervised training).
    • Con’s: These techniques are restricted to multiple factors, such as complex scenarios (without unicolor background), occlusion (partially hidden objects), illumination and shadows, and clutter effect.
  2. Deep Learning methods generally depend on supervised or unsupervised learning, with supervised methods being the standard in computer vision tasks. The performance is limited by the computation power of GPUs, which is rapidly increasing year by year.
    • Pro’s: Deep learning object detection is significantly more robust to occlusion, complex scenes, and challenging illumination.
    • Con’s: A huge amount of training data is required; the process of image annotation is labor-intensive and expensive. For example, labeling 500’000 images to train a custom DL object detection algorithm is considered a small dataset. However, many benchmark datasets (MS COCO, Caltech, KITTI, PASCAL VOC, V5) provide the availability of labeled data.
Today, deep learning object detection is widely accepted by researchers and adopted by computer vision companies to build commercial products.

Deep Learning based object detection for vehicles (cars, trucks, bikes, etc.). An example frame of a commercial real-time application with AI recognition on the stream of IP cameras, built on Viso Suite.
Milestones in state-of-the-art Object Detection
The field of object detection is not as new as it may seem. In fact, object detection has evolved over the past 20 years. The progress of object detection is usually separated into two separate historical periods (before and after the introduction of Deep Learning):
Before 2014 – Traditional Object Detection period
  1. Viola-Jones Detector (2001), the pioneering work that started the development of traditional object detection methods
  2. HOG Detector (2006), a popular feature descriptor for object detection in computer vision and image processing
  3. DPM (2008) with the first introduction of bounding box regression
After 2014 – Deep Learning Detection period
Most important two-stage object detection algorithms
  1. RCNN and SPPNet (2014)
  2. Fast RCNN and Faster RCNN (2015)
  3. Mask R-CNN (2017)
  4. Pyramid Networks/FPN (2017)
  5. G-RCNN (2021)
Most important one-stage object detection algorithms
  1. YOLO (2016)
  2. SSD (2016)
  3. RetinaNet (2017)
  4. YOLOv3 (2018)
  5. YOLOv4 (2020)
  6. YOLOR (2021)
  7. YOLOv7 (2022)
There is also an algorithm named YOLOv8 that was published in 2022; however, it was not released by the creators of the original YOLO algorithms. To understand which algorithm is the best for a given use case, it is important to understand the main characteristics. First, we will look into the key differences between the relevant image recognition algorithms for object detection before discussing the individual algorithms.

Real-time object detection in smart cities for pedestrian detection
One-stage vs. two-stage deep learning object detectors
As you can see in the list above, state-of-the-art object detection methods can be categorized into two main types: One-stage vs. two-stage object detectors.
In general, deep learning based object detectors extract features from the input image or video frame. An object detector solves two subsequent tasks:
  • Task #1: Find an arbitrary number of objects (possibly even zero), and
  • Task #2: Classify every single object and estimate its size with a bounding box.
To simplify the process, you can separate those tasks into two stages. Other methods combine both tasks into one step (single-stage detectors) to achieve higher performance at the cost of accuracy.
Two-stage detectors: In two-stage object detectors, the approximate object regions are proposed using deep features before these features are used for the image classification as well as bounding box regression for the object candidate.
  • The two-stage architecture involves (1) object region proposal with conventional Computer Vision methods or deep networks, followed by (2) object classification based on features extracted from the proposed region with bounding-box regression.
  • Two-stage methods achieve the highest detection accuracy but are typically slower. Because of the many inference steps per image, the performance (frames per second) is not as good as one-stage detectors.
  • Various two-stage detectors include region convolutional neural network (RCNN), with evolutions Faster R-CNN or Mask R-CNN. The latest evolution is the granulated RCNN (G-RCNN).
  • Two-stage object detectors first find a region of interest and use this cropped region for classification. However, such multi-stage detectors are usually not end-to-end trainable because cropping is a non-differentiable operation.
One-stage detectors: One-stage detectors predict bounding boxes over the images without the region proposal step. This process consumes less time and can therefore be used in real-time applications.
  • One-stage object detectors prioritize inference speed and are super fast but not as good at recognizing irregularly shaped objects or a group of small objects.
  • The most popular one-stage detectors include the YOLO, SSD, and RetinaNet. The latest real-time detectors are YOLOv7 (2022), YOLOR (2021) and YOLOv4-Scaled (2020). View the benchmark comparisons below.
  • The main advantages of object detection with single-stage algorithms include a generally faster detection speed and greater structural simplicity and efficiency compared to multi-stage detectors.
How to compare object detection algorithms
The most popular benchmark is the Microsoft COCO dataset. Different models are typically evaluated according to a Mean Average Precision (MAP) metric. In the following, we will compare the best real-time object detection algorithms.
It’s important to note that the algorithm selection depends on the use case and application; different algorithms excel at different tasks (e.g., Beta R-CNN shows the best results for Pedestrian Detection).
The best real-time object detection algorithm (Accuracy)
On the MS COCO dataset and based on the Average Precision (AP), the best real-time object detection algorithm is YOLOv7, followed by Vision Transformer (ViT) such as Swin and DualSwin, PP-YOLOE, YOLOR, YOLOv4, and EfficientDet.

Real-time Object Detection on COCO Benchmark: The state-of-the-art by Average Precision (AP)
The fastest real-time object detection algorithm (Inference time)
Also, on the MS COCO dataset, an important benchmark metric is inference time (ms/Frame, lower is better) or Frames per Second (FPS, higher is better).  The rapid advances in computer vision technology are very visible when looking at inference time comparisons.
Based on current inference times (lower is better), YOLOv7 achieves 3.5ms per frame, compared to YOLOv4 12ms, or the popular YOLOv3 29ms. Note how the introduction of YOLO (one-stage detector) led to dramatically faster inference times compared to any previously established methods, such as the two-stage method Mask R-CNN (333ms).
On a technical level, it is pretty complex to compare different architectures and model versions in a meaningful way. And Edge AI is becoming an integral part of scalable AI solutions, newer algorithms come with a lighter-weight edge-optimized version (see YOLOv7-lite or TensorFlow Lite).

The state-of-the-art by Frames per Second (FPS): The leading computer vision algorithm for real-time object detection on COCO can process 286 frames per second (YOLOv7), and is faster than YOLOv5, YOLOv4, YOLOR, and YOLOv3.

Performance comparison YOLOv7 vs. YOLOv5 vs. YOLOR and Vit Transformers. – Source
Object Detection Use Cases and Applications
The use cases involving object detection are very diverse; there are almost unlimited ways to make computers see like humans to automate manual tasks or create new, AI-powered products and services. It has been implemented in computer vision programs used for a range of applications, from sports production to productivity analytics. To find an extensive list of recent computer vision applications, I recommend you to check out our article about the 50+ Most Popular Computer Vision Applications in 2023.

Example of object detection in video analytics for people detection in dangerous areas using CCTV cameras
Today, object recognition is the core of most vision-based AI software and programs. Object detection plays an important role in scene understanding, which is popular in security, construction, transportation, medical, and military use cases.
  • Object detection in Retail. Strategically placed people counting systems throughout multiple retail stores are used to gather information about how customers spend their time and customer footfall. AI-based customer analysis to detect and track customers with cameras helps to gain an understanding of customer interaction and customer experience, optimize the store layout, and make operations more efficient. A popular use case is the detection of queues to reduce waiting time in retail stores.
  • Autonomous Driving. Self-driving cars depend on object detection to recognize pedestrians, traffic signs, other vehicles, and more. For example, Tesla’s Autopilot AI heavily utilizes object detection to perceive environmental and surrounding threats, such as oncoming vehicles or obstacles.
  • Animal detection in Agriculture. Object detection is used in agriculture for tasks such as counting, animal monitoring, and evaluation of the quality of agricultural products. Damaged produce can be detected while it is in processing using machine learning algorithms.
  • People detection in Security. A wide range of security applications in video surveillance are based on object detection, for example, to detect people in restricted or dangerous areas, suicide prevention, or automating inspection tasks in remote locations with computer vision.
  • Vehicle detection with AI in Transportation. Object recognition is used to detect and count vehicles for traffic analysis or to detect cars that stop in dangerous areas, for example, on crossroads or highways.
  • Medical feature detection in Healthcare. Object detection has allowed for many breakthroughs in the medical community. Because medical diagnostics rely heavily on the study of images, scans, and photographs, object detection involving CT and MRI scans has become extremely useful for diagnosing diseases, for example, with ML algorithms for tumor detection.

Commercial Deep Learning Application for Object Detection in Animal Monitoring, built on Viso Suite
Most Popular Object Detection Algorithms
Popular algorithms used to perform object detection include convolutional neural networks (R-CNN, Region-Based Convolutional Neural Networks), Fast R-CNN, and YOLO (You Only Look Once). The R-CNN’s are in the R-CNN family, while YOLO is part of the single-shot detector family. In the following, we will provide an overview and differences between the popular object detection algorithms.

Object detection overview of popular algorithms
YOLO – You Only Look Once
YOLO stands for “You Only Look Once”, it is a popular type of real-time object detection algorithms used in many commercial products by the largest tech companies that use computer vision. The original YOLO object detector was first released in 2016 and the new architecture was significantly faster than any other object detector.
Since then, multiple versions and variants of YOLO have been released, each providing a significant increase in performance and efficiency. Because various research teams released their own YOLO version, there were several controversies, for example, about YOLOv5. YOLOv4 is an improved version of  YOLOv3. The main innovations are mosaic data enhancement, self-adversarial training, and cross mini-batch normalization.
YOLOv7 is the fastest and most accurate real-time object detection model for computer vision tasks. The official YOLOv7 paper was released in July 2022 by Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Read our Guide about what’s new in YOLOv7.

Camera-based vehicle detection and person detection with YOLOv7 – Built on Viso Suite
SSD – Single-shot detector
SSD is a popular one-stage detector that can predict multiple classes. The method detects objects in images using a single deep neural network by discretizing the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location.
The object detector generates scores for the presence of each object category in each default box and adjusts the box to better fit the object shape. Also, the network combines predictions from multiple feature maps with different resolutions to handle objects of different sizes.
The SSD detector is easy to train and integrate into software systems that require an object detection component. In comparison to other single-stage methods, SSD has much better accuracy, even with smaller input image sizes.

Objects Detection with bounding boxes in a video frame
R-CNN – Region-based Convolutional Neural Networks
Region-based convolutional neural networks or regions with CNN features (R-CNNs) are pioneering approaches that apply deep models to object detection. R-CNN models first select several proposed regions from an image (for example, anchor boxes are one type of selection method) and then label their categories and bounding boxes (e.g., offsets). These labels are created based on predefined classes given to the program. They then use a convolutional neural network to perform forward computation to extract features from each proposed area.
In R-CNN, the inputted image is first divided into nearly two thousand region sections, and then a convolutional neural network is applied for each region, respectively. The size of the regions is calculated, and the correct region is inserted into the neural network. It can be inferred that a detailed method like that can produce time constraints. Training time is significantly greater compared to YOLO because it classifies and creates bounding boxes individually, and a neural network is applied to one region at a time.
In 2015, Fast R-CNN was developed with the intention to cut down significantly on train time. While the original R-CNN independently computed the neural network features on each of as many as two thousand regions of interest, Fast R-CNN runs the neural network once on the whole image. This is very comparable to YOLO’s architecture, but YOLO remains a faster alternative to Fast R-CNN because of the simplicity of the code.
At the end of the network is a novel method known as Region of Interest (ROI) Pooling, which slices out each Region of Interest from the network’s output tensor, reshapes, and classifies it (Image Classification). This makes Fast R-CNN more accurate than the original R-CNN. However, because of this recognition technique, fewer data inputs are required to train Fast R-CNN and R-CNN detectors.
Mask R-CNN
Mask R-CNN is an advancement of Fast R-CNN. The difference between the two is that Mask R-CNN added a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN; it can run at 5 fps. Read more about Mask R-CNN here.

Mask R-CNN Example with image segmentation and object detection
SqueezeDet is the name of a deep neural network for computer vision that was released in 2016. SqueezeDet was specifically developed for autonomous driving, where it performs object detection using computer vision techniques. Like YOLO, it is a single-shot detector algorithm.
In SqueezeDet, convolutional layers are used only to extract feature maps but also as the output layer to compute bounding boxes and class probabilities. The detection pipeline of SqueezeDet models only contains single forward passes of neural networks, allowing them to be extremely fast.
MobileNet is a single-shot multi-box detection network used to run object detection tasks. This model is implemented using the Caffe framework. The model output is a typical vector containing the tracked object data, as previously described.
YOLOR is a novel object detector introduced in 2021. The algorithm applies implicit and explicit knowledge to the model training at the same time. Herefore, YOLOR can learn a general representation and complete multiple tasks through this general representation.
Compared to other object detection methods on the COCO dataset benchmark, the MAP of YOLOR is 3.8% higher than the PP-YOLOv2 at the same inference speed. Compared with the Scaled-YOLOv4, the inference speed has been increased by 88%, making it the fastest real-time object detector available today. Read more about the advantages of object detection using this algorithm in our dedicated article YOLOR – You Only Learn One Representation.
What’s Next?
Object detection is one of the most fundamental and challenging problems in computer vision. As probably the most important computer vision technique, it has received great attention in recent years, especially with the success of deep learning methods that currently dominate the recent state-of-the-art detection methods.
Object detection methods are increasingly important for computer vision applications in any industry.

YOLO v8! The real state-of-the-art

My experience & experiment related to YOLO v8

Source: ultralytics

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi introduced YOLO (You Only Look Once) a family of computer vision models that are seeking the attention and fanfare of many AI enthusiasts. On January 10th, 2023, the latest version of YOLO which is YOLO8 launched claiming advancements in structure and architectural changes with better results.

Introduction :

I experimented with the brand-new, cutting-edge, state-of-the-art YOLO v8 from Ultralytics. YOLO versions 6 and 7 were released to the public over a period of 1–2 months. Both are PyTorch-based models.

Even its predecessor YOLO v5 also has one PyTorch-based model. A few days ago [or we can say a few hours ago] YOLO v8 launched. I thought what if I try to check it on the same parameters? Last time I used the coco dataset but this time, I have used a license plate detection problem.

Dataset :

The dataset had almost 800 images for training,226 for validation, and 113 images for testing. All images we use were pure and not augmented.

Dataset [Image by author]


We purposefully kept epochs to 100 to see its performance in warm-up iterations.


Pytorch-based YOLO v5, YOLO v6, YOLO v7 & YOLO v8

As docs say, YOLOv8 is a cutting-edge, state-of-the-art (SOTA) model that builds upon the success of previous YOLO versions and introduces new features and improvements to further boost performance and flexibility.

Comparison with other YOLOs [Source: Ultralytics]

It uses anchor-free detection and new convolutional layers to make predictions more accurate.

Comparison of different versions of V8 [Source: Roboflow]


The results that YOLO 8 got on RF100 were improved from other versions.

Comparison of different versions of V8 [Source: Roboflow]

Results on the custom dataset:

Now let’s see if this YOLO v8 really works or not on custom datasets. Below are the results of YOLO v8 on the Licence plate detection problem.

Training time for the same dataset for the same epochs [Image by author ]

After training for predefined epochs, I calculated the mean average precision for all.

Map value for all versions of YOLO on custom dataset [Image by author ]

The above figures show us how v8 is outperforming. It is giving us maximum map value at the expense of reduced time for training. Anchor-free detections are faster and more accurate than the previous version.

The working and performance of any model are completely data-dependent & problem statement dependant thing but new additions make things better. This time we didn’t work on latency but those results can be useful for further analysis.

Actual output for license plate detection problem [Source: by author]

If you want to peer into the code yourself, check out the YOLOv8 repository and view this code differential to see how some of the research was done.


The extensibility of YOLOv8 is an important characteristic. It is created as a framework that works with all prior YOLO iterations, making it simple to switch between them and assess their performance. Because of this, YOLOv8 is the best option for those who wish to benefit from the most recent YOLO technology while keeping their current YOLO models functional.


  1. As we can see training time was a big concern if we consider the exponential growth from v5 to v7 but v8 is taking almost 60% time to train while producing outcomes with higher mean average precision. Here, the issue of prolonged training is somewhat addressed.

2. The trade-off between training time and precision is achieved more in v8.

3. New backbone network, a new anchor-free detection head, and a new loss function making things much faster

Want more on YOLO v8? use the below links.

  1. YOLOv8 repository — V8
  2. code differential — V8
  3. Understanding YOLOs
  4. Understanding V8
  5. Docs V8
  6. Understanding V8- Video

YOLO Object Detection Explained

Understand YOLO object detection, its benefits, how it has evolved over the last couple of years and some real-life applications.

Object detection is a technique used in computer vision for the identification and localization of objects within an image or a video.

Image Localization is the process of identifying the correct location of one or multiple objects using bounding boxes, which correspond to rectangular shapes around the objects.

This process is sometimes confused with image classification or image recognition, which aims to predict the class of an image or an object within an image into one of the categories or classes.

The illustration below corresponds to the visual representation of the previous explanation. The object detected within the image is “Person.”

Object detection illustrated from image recognition and localization

Image by Author

In this conceptual blog, you will first understand the benefits of object detection, before introducing YOLO, the state-of-the-art object detection algorithm.

In the second part, we will focus more on the YOLO algorithm and how it works. After that, we will provide some real-life applications using YOLO.

The last section will explain how YOLO evolved from 2015 to 2020 before concluding on the next steps.

What is YOLO?

You Only Look Once (YOLO) is a state-of-the-art, real-time object detection algorithm introduced in 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi in their famous research paper “You Only Look Once: Unified, Real-Time Object Detection”.

The authors frame the object detection problem as a regression problem instead of a classification task by spatially separating bounding boxes and associating probabilities to each of the detected images using a single convolutional neural network (CNN).

By taking the Image Processing with Keras in Python course, you will be able to build Keras based deep neural networks for image classification tasks.

If you are more interested in Pytorch, Deep Learning with Pytorch will teach you about convolutional neural networks and how to use them to build much more powerful models.

Some of the reasons why YOLO is leading the competition include its:

  • Speed
  • Detection accuracy
  • Good generalization
  • Open-source

1- Speed

YOLO is extremely fast because it does not deal with complex pipelines. It can process images at 45 Frames Per Second (FPS). In addition, YOLO reaches more than twice the mean Average Precision (mAP) compared to other real-time systems, which makes it a great candidate for real-time processing.

From the graphic below, we observe that YOLO is far beyond the other object detectors with 91 FPS.

YOLO Speed compared to other state-of-the-art object detectors

YOLO Speed compared to other state-of-the-art object detectors (source)

2- High detection accuracy

YOLO is far beyond other state-of-the-art models in accuracy with very few background errors.

3- Better generalization

This is especially true for the new versions of YOLO, which will be discussed later in the article. With those advancements, YOLO pushed a little further by providing a better generalization for new domains, which makes it great for applications relying on fast and robust object detection.

For instance the Automatic Detection of Melanoma with Yolo Deep Convolutional Neural Networks paper shows that the first version YOLOv1 has the lowest mean average precision for the automatic detection of melanoma disease, compared to YOLOv2 and YOLOv3.

4- Open source

Making YOLO open-source led the community to constantly improve the model. This is one of the reasons why YOLO has made so many improvements in such a limited time.

YOLO Architecture

YOLO architecture is similar to GoogleNet. As illustrated below, it has overall 24 convolutional layers, four max-pooling layers, and two fully connected layers.

YOLO Architecture from the original paper

YOLO Architecture from the original paper (Modified by Author)

The architecture works as follows:

  • Resizes the input image into 448×448 before going through the convolutional network.
  • A 1×1 convolution is first applied to reduce the number of channels, which is then followed by a 3×3 convolution to generate a cuboidal output.
  • The activation function under the hood is ReLU, except for the final layer, which uses a linear activation function.
  • Some additional techniques, such as batch normalization and dropout, respectively regularize the model and prevent it from overfitting.

By completing the Deep Learning in Python course, you will be ready to use Keras to train and test complex, multi-output networks and dive deeper into deep learning.

How Does YOLO Object Detection Work?

Now that you understand the architecture, let’s have a high-level overview of how the YOLO algorithm performs object detection using a simple use case.

“Imagine you built a YOLO application that detects players and soccer balls from a given image. 

But how can you explain this process to someone, especially non-initiated people?

 → That is the whole point of this section. You will understand the whole process of how YOLO performs object detection; how to get image (B) from image (A)”

YOLO Object Detection Image by Jeffrey F Lin on UnsplashImage by Author

The algorithm works based on the following four approaches:

  • Residual blocks
  • Bounding box regression
  • Intersection Over Unions or IOU for short
  • Non-Maximum Suppression.

Let’s have a closer look at each one of them.

1- Residual blocks

This first step starts by dividing the original image (A) into NxN grid cells of equal shape, where N in our case is 4 shown on the image on the right. Each cell in the grid is responsible for localizing and predicting the class of the object that it covers, along with the probability/confidence value.

Application of grid cells to the original image

Image by Author

2- Bounding box regression

The next step is to determine the bounding boxes which correspond to rectangles highlighting all the objects in the image. We can have as many bounding boxes as there are objects within a given image.

YOLO determines the attributes of these bounding boxes using a single regression module in the following format, where Y is the final vector representation for each bounding box.

Y = [pc, bx, by, bh, bw, c1, c2]

This is especially important during the training phase of the model.

  • pc corresponds to the probability score of the grid containing an object. For instance, all the grids in red will have a probability score higher than zero. The image on the right is the simplified version since the probability of each yellow cell is zero (insignificant).

Identification of significant and insignificant grids

Image by Author

  • bx, by are the x and y coordinates of the center of the bounding box with respect to the enveloping grid cell.
  • bh, bw correspond to the height and the width of the bounding box with respect to the enveloping grid cell.
  • c1 and c2 correspond to the two classes Player and Ball. We can have as many classes as your use case requires.

To understand, let’s pay closer attention to the player on the bottom right.

Bounding box regression identificationImage by Author

3- Intersection Over Unions or IOU

Most of the time, a single object in an image can have multiple grid box candidates for prediction, even though not all of them are relevant. The goal of the IOU (a value between 0 and 1) is to discard such grid boxes to only keep those that are relevant. Here is the logic behind it:

  • The user defines its IOU selection threshold, which can be, for instance, 0.5.
  • Then YOLO computes the IOU of each grid cell which is the Intersection area divided by the Union Area.
  • Finally, it ignores the prediction of the grid cells having an IOU ≤ threshold and considers those with an IOU > threshold.

Below is an illustration of applying the grid selection process to the bottom left object. We can observe that the object originally had two grid candidates, then only “Grid 2” was selected at the end.

Process of selecting the best grids for prediction

Image by Author

4- Non-Max Suppression or NMS

Setting a threshold for the IOU is not always enough because an object can have multiple boxes with IOU beyond the threshold, and leaving all those boxes might include noise. Here is where we can use NMS to keep only the boxes with the highest probability score of detection.

YOLO Applications

YOLO object detection has different applications in our day-to-day life. In this section, we will cover some of them in the following domains: healthcare, agriculture, security surveillance, and self-driving cars.

1- Application in industries

Object detection has been introduced in many practical industries such as healthcare and agriculture. Let’s understand each one with specific examples.


Specifically in surgery, it can be challenging to localize organs in real-time, due to biological diversity from one patient to another. Kidney Recognition in CT used YOLOv3 to facilitate localizing kidneys in 2D and 3D from computerized tomography (CT) scans.

The Biomedical Image Analysis in Python course can help you learn the fundamentals of exploring, manipulating, and measuring biomedical image data using Python.

2D Kidney detection by YOLOv3

2D Kidney detection by YOLOv3 (Image from Kidney Recognition in CT using YOLOv3)


Artificial Intelligence and robotics are playing a major role in modern agriculture. Harvesting robots are vision-based robots that were introduced to replace the manual picking of fruits and vegetables. One of the best models in this field uses YOLO. In Tomato detection based on modified YOLOv3 framework, the authors describe how they used YOLO to identify the types of fruits and vegetables for efficient harvest.

Comparison of YOLO-tomato models

Image from Tomato detection based on modified YOLOv3 framework (source)

2- Security surveillance

Even though object detection is mostly used in security surveillance, this is not the only application. YOLOv3 has been used during covid19 pandemic to estimate social distance violations between people.

You can further your reading on this topic from A deep-learning-based social distance monitoring framework for COVID-19.

3- Self-driving cars

Real-time object detection is part of the DNA of autonomous vehicle systems. This integration is vital for autonomous vehicles because they need to properly identify the correct lanes and all the surrounding objects and pedestrians to increase road safety. The real-time aspect of YOLO makes it a better candidate compared to simple image segmentation approaches.

YOLO, YOLOv2, YOLO9000, YOLOv3, YOLOv4, YOLOR, YOLOX, YOLOv5, YOLOv6, YOLOv7 and Differences

Since the first release of YOLO in 2015, it has evolved a lot with different versions. In this section, we will understand the differences between each of these versions.

YOLO Timeframe 2015 to 2022

YOLO or YOLOv1, the starting point

This first version of YOLO was a game changer for object detection, because of its ability to quickly and efficiently recognize objects.

However, like many other solutions, the first version of YOLO has its own limitations:

  • It struggles to detect smaller images within a group of images, such as a group of persons in a stadium. This is because each grid in YOLO architecture is designed for single object detection.
  • Then, YOLO is unable to successfully detect new or unusual shapes.
  • Finally, the loss function used to approximate the detection performance treats errors the same for both small and large bounding boxes, which in fact creates incorrect localizations.

YOLOv2 or YOLO9000

YOLOv2 was created in 2016 with the idea of making the YOLO model better, faster and stronger.

The improvement includes but is not limited to the use of Darknet-19 as new architecture, batch normalization, higher resolution of inputs, convolution layers with anchors, dimensionality clustering, and (5) Fine-grained features.

 1- Batch normalization

Adding a batch normalization layer improved the performance by 2% mAP. This batch normalization included a regularization effect, preventing overfitting.

2- Higher input resolution

YOLOv2 directly uses a higher resolution 448×448 input instead of 224×224, which makes the model adjust its filter to perform better on higher resolution images. This approach increased the accuracy by 4% mAP, after being trained for 10 epochs on the ImageNet data.

3- Convolution layers using anchor boxes

Instead of predicting the exact coordinate of bounding boxes of the objects as YOLOv1 operates, YOLOv2 simplifies the problem by replacing the fully connected layers with anchors boxes. This approach slightly decreased the accuracy, but improved the model recall by 7%, which gives more room for improvement.

4- Dimensionality clustering

The previously mentioned anchor boxes are automatically found by YOLOv2 using k-means dimensionality clustering with k=5 instead of performing a manual selection. This novel approach provided a good tradeoff between the recall and the precision of the model.

For a better understanding of the k-means dimensionality clustering, take a look at our K-Means Clustering in Python with scikit-learn and K-Means Clustering in R tutorials. They dive into the concept of k-means clustering using Python and R.

5- Fine-grained features

YOLOv2 predictions generate 13×13 feature maps, which is of course enough for large object detection. But for much finer objects detection, the architecture can be modified by turning the 26 × 26 × 512 feature map into a 13 × 13 × 2048 feature map, concatenated with the original features. This approach improved the model performance by 1%.

YOLOv3 — An incremental improvement

An incremental improvement has been performed on the YOLOv2 to create YOLOv3.

The change mainly includes a new network architecture: Darknet-53. This is a 106 neural network, with upsampling networks and residual blocks. It is much bigger, faster, and more accurate compared to Darknet-19, which is the backbone of YOLOv2. This new architecture has been beneficial on many levels:

1- Better bounding box prediction

A logistic regression model is used by YOLOv3 to predict the objectness score for each bounding box.

2- More accurate class predictions

Instead of using softmax as performed in YOLOv2, independent logistic classifiers have been introduced to accurately predict the class of the bounding boxes. This is even useful when facing more complex domains with overlapping labels (e.g. Person → Soccer Player). Using a softmax would constrain each box to have only one class, which is not always true.

3- More accurate prediction at different scales

YOLOv3 performs three predictions at different scales for each location within the input image to help with the upsampling from the previous layers. This strategy allows getting fine-grained and more meaningful semantic information for a better quality output image.

YOLOv4 — Optimal Speed and Accuracy of Object Detection

This version of YOLO has an Optimal Speed and Accuracy of Object Detection compared to all the previous versions and other state-of-the-art object detectors.

The image below shows the YOLOv4 outperforming YOLOv3 and FPS in speed by 10% and 12% respectively.

YOLOv4 Speed compared to YOLOv3

YOLOv4 Speed compared to YOLOv3 and other state-of-the-art object detectors (source)

YOLOv4 is specifically designed for production systems and optimized for parallel computations.

The backbone of YOLOv4’s architecture is CSPDarknet53, a network containing 29 convolution layers with 3 × 3 filters and approximately 27.6 million parameters.

This architecture, compared to YOLOv3, adds the following information for better object detection:

  • Spatial Pyramid Pooling (SPP) block significantly increases the receptive field, segregates the most relevant context features, and does not affect the network speed.
  • Instead of the Feature Pyramid Network (FPN) used in YOLOv3, YOLOv4 uses PANet for parameter aggregation from different detection levels.
  • Data augmentation uses the mosaic technique that combines four training images in addition to a self-adversarial training approach.
  • Perform optimal hyper-parameter selection using genetic algorithms.

YOLOR — You Only Look One Representation

As a Unified Network for Multiple Tasks, YOLOR is based on the unified network which is a combination of explicit and implicit knowledge approaches.

YOLOR unified network architecture

Unified network architecture (source)

Explicit knowledge is normal or conscious learning. Implicit learning on the other hand is one performed subconsciously (from experience).

Combining these two technics, YOLOR is able to create a more robust architecture based on three processes: (1) feature alignment, (2) prediction alignment for object detection, and (3) canonical representation for multi-task learning

1- Prediction alignment

This approach introduces an implicit representation into the feature map of every feature pyramid network (FPN), which improves the precision by about 0.5%.

2- Prediction refinement for object detection

The model predictions are refined by adding implicit representation to the output layers of the network.

3- Canonical representation for multi-task learning

Performing multi-task training requires the execution of the joint optimization on the loss function shared across all the tasks. This process can decrease the overall performance of the model, and this issue can be mitigated with the integration of the canonical representation during the model training.

From the following graphic, we can observe that YOLOR achieved on the MS COCO data state-of-the-art inference speed compared to other models.


YOLOR performance vs. YOLOv4 and other models (source)

YOLOX — Exceeding YOLO Series in 2021

This uses a baseline that is a modified version of YOLOv3, with Darknet-53 as its backbone.

Published in the paper Exceeding YOLO Series in 2021, YOLOX brings to the table the following four key characteristics to create a better model compared to the older versions.

1- An efficient decoupled head

The coupled head used in the previous YOLO versions is shown to reduce the models’ performance. YOLOX uses a decoupled instead, which allows separating classification and localization tasks, thus increasing the performance of the model.

2- Robust data augmentation

Integration of Mosaic and MixUp into the data augmentation approach considerably increased YOLOX’s performance.

3- An anchor-free system

 Anchor-based algorithms perform clustering under the hood, which increases the inference time. Removing the anchor mechanism in YOLOX reduced the number of predictions per image, and significantly improved inference time.

4- SimOTA for label assignment

Instead of using the intersection of union (IoU) approach, the author introduced SimOTA, a more robust label assignment strategy that achieves state-of-the-art results by not only reducing the training time but also avoiding extra hyperparameter issues. In addition to that, it improved the detection mAP by 3%.


YOLOv5, compared to other versions, does not have a published research paper, and it is the first version of YOLO to be implemented in Pytorch, rather than Darknet.

Released by Glenn Jocher in June 2020, YOLOv5, similarly to YOLOv4, uses CSPDarknet53 as the backbone of its architecture. The release includes five different model sizes: YOLOv5s (smallest), YOLOv5m, YOLOv5l, and YOLOv5x (largest).

One of the major improvements in YOLOv5 architecture is the integration of the Focus layer, represented by a single layer, which is created by replacing the first three layers of YOLOv3. This integration reduced the number of layers, and number of parameters and also increased both forward and backward speed without any major impact on the mAP.

The following illustration compares the training time between YOLOv4 and YOLOv5s.

YOLOv4 vs YOLOv5 Training Time

Training time comparison between YOLOv4 and YOLOv5 (source)

YOLOv6 — A Single-Stage Object Detection Framework for Industrial Applications

Dedicated to industrial applications with hardware-friendly efficient design and high performance, the YOLOv6 (MT-YOLOv6) framework was released by Meituan, a Chinese e-commerce company.

Written in Pytorch, this new version was not part of the official YOLO but still got the name YOLOv6 because its backbone was inspired by the original one-stage YOLO architecture.

YOLOv6 introduced three significant improvements to the previous YOLOv5: a hardware-friendly backbone and neck design, an efficient decoupled head, and a more effective training strategy.

YOLOv6 provides outstanding results compared to the previous YOLO versions in terms of accuracy and speed on the COCO dataset as illustrated below.

YOLO Model Comparison

Comparison of state-of-the-art efficient object detectors. All models are tested with TensorRT 7 except that the quantized model is with TensorRT 8 (source)

  • YOLOv6-N achieved 35.9% AP on the COCO dataset at a throughput of 1234 (throughputs) FPS on an NVIDIA Tesla T4 GPU.
  • YOLOv6-S reached a new state-of-the-art 43.3% AP at 869 FPS.
  • YOLOv6-M and YOLOv6-L also achieved better accuracy performance respectively at 49.5% and 52.3% with the same inference speed.

All these characteristics make YOLOv5, the right algorithm for industrial applications.

YOLOv7 — Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

YOLOv7 was released in July 2022 in the paper Trained bag-of-freebies sets new state-of-the-art for real-time object detectors. This version is making a significant move in the field of object detection, and it surpassed all the previous models in terms of accuracy and speed.

YOLOV7 VS Competitors

Comparison of YOLOv7 inference time with other real-time object detectors (source)

YOLOv7 has made a major change in its (1) architecture and (2) at the Trainable bag-of-freebies level:

1- Architectural level

YOLOv7 reformed its architecture by integrating the Extended Efficient Layer Aggregation Network (E-ELAN) which allows the model to learn more diverse features for better learning.

In addition, YOLOv7 scales its architecture by concatenating the architecture of the models it is derived from such as YOLOv4, Scaled YOLOv4, and YOLO-R. This allows the model to meet the needs of different inference speeds.

YOLO Compound Scaling Depth

Compound scaling up depth and width for concatenation-based model (source)

2- Trainable bag-of-freebies

The term bag-of-freebies refers to improving the model’s accuracy without increasing the training cost, and this is the reason why YOLOv7 increased not only the inference speed but also the detection accuracy.


This article has covered the benefit of YOLO compared to other state-of-the-art object detection algorithms, and its evolution from 2015 to 2020 with a highlight of its benefits.

Given the rapid advancement of YOLO, there is no doubt that it will remain the leader in the field of object detection for a very long time.

The next step of this article will be the application of the YOLO algorithm to real-world cases. Until then, our Introduction to Deep Learning in Python course can help you learn the fundamentals of neural networks and how to build deep learning models using Keras 2.0 in Python.


Why the YOLO algorithm is important

This phenomenon seeks to answer two basic questions:

  1. What is the object? This question seeks to identify the object in a specific image.
  2. Where is it? This question seeks to establish the exact location of the object within the image.

YOLO algorithm is important because of the following reasons:

  • Speed: This algorithm improves the speed of detection because it can predict objects in real-time.
  • High accuracy: YOLO is a predictive technique that provides accurate results with minimal background errors.
  • Learning capabilities: The algorithm has excellent learning capabilities that enable it to learn the representations of objects and apply them in object detection.

How the YOLO algorithm works

YOLO algorithm works using the following three techniques:

  • Residual blocks
  • Bounding box regression
  • Intersection Over Union (IOU)

Residual blocks

First, the image is divided into various grids. Each grid has a dimension of S x S. The following image shows how an input image is divided into grids.


Image Source

In the image above, there are many grid cells of equal dimension. Every grid cell will detect objects that appear within them. For example, if an object center appears within a certain grid cell, then this cell will be responsible for detecting it.

Bounding box regression

A bounding box is an outline that highlights an object in an image.

Every bounding box in the image consists of the following attributes:

  • Width (bw)
  • Height (bh)
  • Class (for example, person, car, traffic light, etc.)- This is represented by the letter c.
  • Bounding box center (bx,by)

The following image shows an example of a bounding box. The bounding box has been represented by a yellow outline.

Bounding Box

Image Source

YOLO uses a single bounding box regression to predict the height, width, center, and class of objects. In the image above, represents the probability of an object appearing in the bounding box.

Intersection over union (IOU)

Intersection over union (IOU) is a phenomenon in object detection that describes how boxes overlap. YOLO uses IOU to provide an output box that surrounds the objects perfectly.

Each grid cell is responsible for predicting the bounding boxes and their confidence scores. The IOU is equal to 1 if the predicted bounding box is the same as the real box. This mechanism eliminates bounding boxes that are not equal to the real box.

The following image provides a simple example of how IOU works.


Image Source

In the image above, there are two bounding boxes, one in green and the other one in blue. The blue box is the predicted box while the green box is the real box. YOLO ensures that the two bounding boxes are equal.

Combination of the three techniques

The following image shows how the three techniques are applied to produce the final detection results.

How YOLO Algorithm Works

Image Source

First, the image is divided into grid cells. Each grid cell forecasts B bounding boxes and provides their confidence scores. The cells predict the class probabilities to establish the class of each object.

For example, we can notice at least three classes of objects: a car, a dog, and a bicycle. All the predictions are made simultaneously using a single convolutional neural network.

Intersection over union ensures that the predicted bounding boxes are equal to the real boxes of the objects. This phenomenon eliminates unnecessary bounding boxes that do not meet the characteristics of the objects (like height and width). The final detection will consist of unique bounding boxes that fit the objects perfectly.

For example, the car is surrounded by the pink bounding box while the bicycle is surrounded by the yellow bounding box. The dog has been highlighted using the blue bounding box.

Applications of YOLO

YOLO algorithm can be applied in the following fields:

  • Autonomous driving: YOLO algorithm can be used in autonomous cars to detect objects around cars such as vehicles, people, and parking signals. Object detection in autonomous cars is done to avoid collision since no human driver is controlling the car.
  • Wildlife: This algorithm is used to detect various types of animals in forests. This type of detection is used by wildlife rangers and journalists to identify animals in videos (both recorded and real-time) and images. Some of the animals that can be detected include giraffes, elephants, and bears.
  • Security: YOLO can also be used in security systems to enforce security in an area. Let’s assume that people have been restricted from passing through a certain area for security reasons. If someone passes through the restricted area, the YOLO algorithm will detect him/her, which will require the security personnel to take further action.

YOLO: Real-Time Object Detection

You only look once (YOLO) is a state-of-the-art, real-time object detection system. On a Pascal Titan X it processes images at 30 FPS and has a mAP of 57.9% on COCO test-dev.

Comparison to Other Detectors

YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par with Focal Loss but about 4x faster. Moreover, you can easily tradeoff between speed and accuracy simply by changing the size of the model, no retraining required!

Performance on the COCO Dataset

Model Train Test mAP FLOPS FPS Cfg Weights
SSD300 COCO trainval test-dev 41.2 46 link
SSD500 COCO trainval test-dev 46.5 19 link
YOLOv2 608×608 COCO trainval test-dev 48.1 62.94 Bn 40 cfg weights
Tiny YOLO COCO trainval test-dev 23.7 5.41 Bn 244 cfg weights

SSD321 COCO trainval test-dev 45.4 16 link
DSSD321 COCO trainval test-dev 46.1 12 link
R-FCN COCO trainval test-dev 51.9 12 link
SSD513 COCO trainval test-dev 50.4 8 link
DSSD513 COCO trainval test-dev 53.3 6 link
FPN FRCN COCO trainval test-dev 59.1 6 link
Retinanet-50-500 COCO trainval test-dev 50.9 14 link
Retinanet-101-500 COCO trainval test-dev 53.1 11 link
Retinanet-101-800 COCO trainval test-dev 57.5 5 link
YOLOv3-320 COCO trainval test-dev 51.5 38.97 Bn 45 cfg weights
YOLOv3-416 COCO trainval test-dev 55.3 65.86 Bn 35 cfg weights
YOLOv3-608 COCO trainval test-dev 57.9 140.69 Bn 20 cfg weights
YOLOv3-tiny COCO trainval test-dev 33.1 5.56 Bn 220 cfg weights
YOLOv3-spp COCO trainval test-dev 60.6 141.45 Bn 20 cfg weights

How It Works

Prior detection systems repurpose classifiers or localizers to perform detection. They apply the model to an image at multiple locations and scales. High scoring regions of the image are considered detections.

We use a totally different approach. We apply a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

Our model has several advantages over classifier-based systems. It looks at the whole image at test time so its predictions are informed by global context in the image. It also makes predictions with a single network evaluation unlike systems like R-CNN which require thousands for a single image. This makes it extremely fast, more than 1000x faster than R-CNN and 100x faster than Fast R-CNN. See our paper for more details on the full system.

What’s New in Version 3?

YOLOv3 uses a few tricks to improve training and increase performance, including: multi-scale predictions, a better backbone classifier, and more. The full details are in our paper!

Detection Using A Pre-Trained Model

This post will guide you through detecting objects with the YOLO system using a pre-trained model. If you don’t already have Darknet installed, you should do that first. Or instead of reading all that just run:

git clone
cd darknet


You already have the config file for YOLO in the cfg/ subdirectory. You will have to download the pre-trained weight file here (237 MB). Or just run this:


Then run the detector!

./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg

You will see some output like this:

layer     filters    size              input                output
    0 conv     32  3 x 3 / 1   416 x 416 x   3   ->   416 x 416 x  32  0.299 BFLOPs
    1 conv     64  3 x 3 / 2   416 x 416 x  32   ->   208 x 208 x  64  1.595 BFLOPs
  105 conv    255  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 255  0.353 BFLOPs
  106 detection
truth_thresh: Using default '1.000000'
Loading weights from yolov3.weights...Done!
data/dog.jpg: Predicted in 0.029329 seconds.
dog: 99%
truck: 93%
bicycle: 99%

Darknet prints out the objects it detected, its confidence, and how long it took to find them. We didn’t compile Darknet with OpenCV so it can’t display the detections directly. Instead, it saves them in predictions.png. You can open it to see the detected objects. Since we are using Darknet on the CPU it takes around 6-12 seconds per image. If we use the GPU version it would be much faster.

I’ve included some example images to try in case you need inspiration. Try data/eagle.jpgdata/dog.jpgdata/person.jpg, or data/horses.jpg!

The detect command is shorthand for a more general version of the command. It is equivalent to the command:

./darknet detector test cfg/ cfg/yolov3.cfg yolov3.weights data/dog.jpg

You don’t need to know this if all you want to do is run detection on one image but it’s useful to know if you want to do other things like run on a webcam (which you will see later on).

Multiple Images

Instead of supplying an image on the command line, you can leave it blank to try multiple images in a row. Instead you will see a prompt when the config and weights are done loading:

./darknet detect cfg/yolov3.cfg yolov3.weights
layer     filters    size              input                output
    0 conv     32  3 x 3 / 1   416 x 416 x   3   ->   416 x 416 x  32  0.299 BFLOPs
    1 conv     64  3 x 3 / 2   416 x 416 x  32   ->   208 x 208 x  64  1.595 BFLOPs
  104 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256  1.595 BFLOPs
  105 conv    255  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 255  0.353 BFLOPs
  106 detection
Loading weights from yolov3.weights...Done!
Enter Image Path:

Enter an image path like data/horses.jpg to have it predict boxes for that image.

Once it is done it will prompt you for more paths to try different images. Use Ctrl-C to exit the program once you are done.

Changing The Detection Threshold

By default, YOLO only displays objects detected with a confidence of .25 or higher. You can change this by passing the -thresh <val> flag to the yolo command. For example, to display all detection you can set the threshold to 0:

./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg -thresh 0

Which produces:


So that’s obviously not super useful but you can set it to different values to control what gets thresholded by the model.

Tiny YOLOv3

We have a very small model as well for constrained environments, yolov3-tiny. To use this model, first download the weights:


Then run the detector with the tiny config file and weights:

./darknet detect cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg

Real-Time Detection on a Webcam

Running YOLO on test data isn’t very interesting if you can’t see the result. Instead of running it on a bunch of images let’s run it on the input from a webcam!

To run this demo you will need to compile Darknet with CUDA and OpenCV. Then run the command:

./darknet detector demo cfg/ cfg/yolov3.cfg yolov3.weights

YOLO will display the current FPS and predicted classes as well as the image with bounding boxes drawn on top of it.

You will need a webcam connected to the computer that OpenCV can connect to or it won’t work. If you have multiple webcams connected and want to select which one to use you can pass the flag -c <num> to pick (OpenCV uses webcam 0 by default).

You can also run it on a video file if OpenCV can read the video:

./darknet detector demo cfg/ cfg/yolov3.cfg yolov3.weights <video file>

That’s how we made the YouTube video above.

Training YOLO on VOC

You can train YOLO from scratch if you want to play with different training regimes, hyper-parameters, or datasets. Here’s how to get it working on the Pascal VOC dataset.

Get The Pascal VOC Data

To train YOLO you will need all of the VOC data from 2007 to 2012. You can find links to the data here. To get all the data, make a directory to store it all and from that directory run:

tar xf VOCtrainval_11-May-2012.tar
tar xf VOCtrainval_06-Nov-2007.tar
tar xf VOCtest_06-Nov-2007.tar

There will now be a VOCdevkit/ subdirectory with all the VOC training data in it.

Generate Labels for VOC

Now we need to generate the label files that Darknet uses. Darknet wants a .txt file for each image with a line for each ground truth object in the image that looks like:

<object-class> <x> <y> <width> <height>

Where xywidth, and height are relative to the image’s width and height. To generate these file we will run the script in Darknet’s scripts/ directory. Let’s just download it again because we are lazy.


After a few minutes, this script will generate all of the requisite files. Mostly it generates a lot of label files in VOCdevkit/VOC2007/labels/ and VOCdevkit/VOC2012/labels/. In your directory you should see:

2007_test.txt   VOCdevkit
2007_val.txt    VOCtest_06-Nov-2007.tar
2012_train.txt  VOCtrainval_06-Nov-2007.tar
2012_val.txt    VOCtrainval_11-May-2012.tar

The text files like 2007_train.txt list the image files for that year and image set. Darknet needs one text file with all of the images you want to train on. In this example, let’s train with everything except the 2007 test set so that we can test our model. Run:

cat 2007_train.txt 2007_val.txt 2012_*.txt > train.txt

Now we have all the 2007 trainval and the 2012 trainval set in one big list. That’s all we have to do for data setup!

Modify Cfg for Pascal Data

Now go to your Darknet directory. We have to change the cfg/ config file to point to your data:

  1 classes= 20
  2 train  = <path-to-voc>/train.txt
  3 valid  = <path-to-voc>2007_test.txt
  4 names = data/voc.names
  5 backup = backup

You should replace <path-to-voc> with the directory where you put the VOC data.

Download Pretrained Convolutional Weights

For training we use convolutional weights that are pre-trained on Imagenet. We use weights from the darknet53 model. You can just download the weights for the convolutional layers here (76 MB).


Train The Model

Now we can train! Run the command:

./darknet detector train cfg/ cfg/yolov3-voc.cfg darknet53.conv.74

Training YOLO on COCO

You can train YOLO from scratch if you want to play with different training regimes, hyper-parameters, or datasets. Here’s how to get it working on the COCO dataset.

Get The COCO Data

To train YOLO you will need all of the COCO data and labels. The script scripts/ will do this for you. Figure out where you want to put the COCO data and download it, for example:

cp scripts/ data
cd data

Now you should have all the data and the labels generated for Darknet.

Modify cfg for COCO

Now go to your Darknet directory. We have to change the cfg/ config file to point to your data:

  1 classes= 80
  2 train  = <path-to-coco>/trainvalno5k.txt
  3 valid  = <path-to-coco>/5k.txt
  4 names = data/coco.names
  5 backup = backup

You should replace <path-to-coco> with the directory where you put the COCO data.

You should also modify your model cfg for training instead of testing. cfg/yolo.cfg should look like this:

# Testing
# batch=1
# subdivisions=1
# Training

Train The Model

Now we can train! Run the command:

./darknet detector train cfg/ cfg/yolov3.cfg darknet53.conv.74

If you want to use multiple gpus run:

./darknet detector train cfg/ cfg/yolov3.cfg darknet53.conv.74 -gpus 0,1,2,3

If you want to stop and restart training from a checkpoint:

./darknet detector train cfg/ cfg/yolov3.cfg backup/yolov3.backup -gpus 0,1,2,3

YOLOv3 on the Open Images dataset


./darknet detector test cfg/ cfg/yolov3-openimages.cfg yolov3-openimages.weights

What Happened to the Old YOLO Site?

If you are using YOLO version 2 you can still find the site here:


If you use YOLOv3 in your work please cite our paper!

  title={YOLOv3: An Incremental Improvement},
  author={Redmon, Joseph and Farhadi, Ali},
  journal = {arXiv},

source 1

source 2

source 3

source 4

source 5

source 6

source 7