Grasping Computer Vision Fundamentals Using Python

Computer vision is a branch of artificial intelligence (AI) that empowers systems to interpret and identify objects, people, and scenes within images or videos. For instance, it can analyze visual data to detect human figures in photographs. Let’s embark on a hands-on project to test this capability: we’ll supply the system with a personal photograph to evaluate its recognition accuracy. Before diving in, let’s clarify two foundational concepts critical to this process. Key Terminology Bounding Box: A rectangular frame that outlines detected objects or individuals in visual data. For example, when identifying people in a photo, the system encircles each detected individual with a bounding box. You’ll see this visually demonstrated later. Intersection over Union (IoU): A metric that evaluates how precisely a predicted bounding box aligns with the ground truth—the actual, labeled location of the object. IoU scores range from 0 (no overlap) to 1 (perfect alignment). It is calculated by dividing the overlapping area of the predicted and ground-truth boxes by their combined area: Prerequisites Before you begin, ensure you have the following: Python installed: This article uses Python 3.12.4 or later version. You can check your Python version by running the command: python --version If you encounter an error, ensure Python is installed correctly. You can download Python from the official website. Text editor: This tutorial uses Visual Studio Code (VS Code) as the text editor. You can download VS Code here. However, feel free to use any text editor you choose. Before diving into our project, it’s essential to set up a clean working environment. Here’s how to do it step by step: Create a project folder: First, choose a location for your project folder. For this tutorial, we will create it on the desktop. On macOS: Navigate to your desktop. Create a new folder named, for example, “facial-recognition.” Open the terminal in this folder by clicking on it and pressing Ctrl + Cmd + C. On Windows: Navigate to your desktop. Create a new folder, for example, “Computer-Vision.” Right-click the folder and select “Open in Terminal” or “Open PowerShell window here.” Create and activate a virtual environment: This helps keep project dependencies isolated from the global Python installation. Create a virtual environment: In your terminal, run the following command to create a virtual environment named venv inside the project folder: python -m venv venv Activate the virtual environment: To activate the virtual environment, use the following commands based on your operating system: source venv/bin/activate //activate virtual environment on Mac .\venv\Scripts\activate //activate virtual environment on Windows Figure 1: Illustration of an activated virtual environment Virtual Environment Setup As illustrated in the screenshot above, I’m creating a Python 3 virtual environment to isolate dependencies. This step is essential because my system has multiple Python versions installed, and specifying python3 ensures compatibility. Project Structure With the environment ready, we’ll: Create a main.py file for our code. Add an images folder in the project directory to store the reference photo (used later to identify me via the webcam). Installation of Critical Dependencies We'll now install three critical packages that work synergistically to streamline computer vision development: ultralytics- Provides state-of-the-art YOLO (You Only Look Once) models for real-time object detection. opencv-python (OpenCV) - Offers optimized low-level computer vision operations for image/video processing. cvzone - Simplifies OpenCV workflows with prebuilt utilities for annotations and GUI elements. pip install ultralytics opencv-python cvzone This demonstrates: YOLOv8 for object detection OpenCV for camera/webcam handling CVZone for simplified annotations Math for coordinate/confidence calculations Import required libraries import math # For mathematical operations (e.g., confidence rounding) from ultralytics import YOLO # YOLOv8 object detection model import cv2 # OpenCV for camera handling and image processing import cvzone # CVZone for simplified annotations and UI elements # Initialize webcam capture with HD resolution (1280x720) cap = cv2.VideoCapture(0) cap.set(3, 1280) # Set width cap.set(4, 720) # Set height # Load YOLOv8 nano model (pretrained on COCO dataset) model = YOLO("../Yolo-Weights/yolov8n.pt") # COCO dataset class names (80 total classes for YOLOv8) classNames = ["person", "bicycle", "car", "motorbike", "aeroplane", "bus", "train", "truck", "boat", "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball

May 13, 2025 - 01:08

Grasping Computer Vision Fundamentals Using Python

Computer vision is a branch of artificial intelligence (AI) that empowers systems to interpret and identify objects, people, and scenes within images or videos. For instance, it can analyze visual data to detect human figures in photographs. Let’s embark on a hands-on project to test this capability: we’ll supply the system with a personal photograph to evaluate its recognition accuracy. Before diving in, let’s clarify two foundational concepts critical to this process.

Key Terminology

Bounding Box:
A rectangular frame that outlines detected objects or individuals in visual data. For example, when identifying people in a photo, the system encircles each detected individual with a bounding box. You’ll see this visually demonstrated later.
Intersection over Union (IoU):
A metric that evaluates how precisely a predicted bounding box aligns with the ground truth—the actual, labeled location of the object. IoU scores range from 0 (no overlap) to 1 (perfect alignment). It is calculated by dividing the overlapping area of the predicted and ground-truth boxes by their combined area:

Prerequisites
Before you begin, ensure you have the following:

Python installed: This article uses Python 3.12.4 or later version. You can check your Python version by running the command:

python --version

If you encounter an error, ensure Python is installed correctly. You can download Python from the official website.

Text editor: This tutorial uses Visual Studio Code (VS Code) as the text editor. You can download VS Code here. However, feel free to use any text editor you choose.

Before diving into our project, it’s essential to set up a clean working environment. Here’s how to do it step by step:

Create a project folder: First, choose a location for your project folder. For this tutorial, we will create it on the desktop.

On macOS:

Navigate to your desktop.
Create a new folder named, for example, “facial-recognition.”
Open the terminal in this folder by clicking on it and pressing Ctrl + Cmd + C.

On Windows:

Navigate to your desktop.
Create a new folder, for example, “Computer-Vision.”
Right-click the folder and select “Open in Terminal” or “Open PowerShell window here.”

Create and activate a virtual environment: This helps keep project dependencies isolated from the global Python installation.

Create a virtual environment:
In your terminal, run the following command to create a virtual environment named venv inside the project folder:

python -m venv venv

Activate the virtual environment:
To activate the virtual environment, use the following commands based on your operating system:

source venv/bin/activate //activate virtual environment on Mac

.\venv\Scripts\activate //activate virtual environment on Windows

Figure 1: Illustration of an activated virtual environment

Virtual Environment Setup
As illustrated in the screenshot above, I’m creating a Python 3 virtual environment to isolate dependencies. This step is essential because my system has multiple Python versions installed, and specifying python3 ensures compatibility.

Project Structure
With the environment ready, we’ll:

Create a main.py file for our code.
Add an images folder in the project directory to store the reference photo (used later to identify me via the webcam).

Installation of Critical Dependencies

We'll now install three critical packages that work synergistically to streamline computer vision development:

ultralytics- Provides state-of-the-art YOLO (You Only Look Once) models for real-time object detection.
opencv-python (OpenCV) - Offers optimized low-level computer vision operations for image/video processing.
cvzone - Simplifies OpenCV workflows with prebuilt utilities for annotations and GUI elements.

pip install ultralytics opencv-python cvzone

This demonstrates:

YOLOv8 for object detection
OpenCV for camera/webcam handling
CVZone for simplified annotations
Math for coordinate/confidence calculations

Import required libraries

import math # For mathematical operations (e.g., confidence rounding)
from ultralytics import YOLO # YOLOv8 object detection model
import cv2 # OpenCV for camera handling and image processing
import cvzone # CVZone for simplified annotations and UI elements


# Initialize webcam capture with HD resolution (1280x720)
cap = cv2.VideoCapture(0)
cap.set(3, 1280)  # Set width
cap.set(4, 720)  # Set height

# Load YOLOv8 nano model (pretrained on COCO dataset)
model = YOLO("../Yolo-Weights/yolov8n.pt")

# COCO dataset class names (80 total classes for YOLOv8)
classNames = ["person", "bicycle", "car", "motorbike", "aeroplane", "bus", "train", "truck", "boat",
              "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat",
              "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella",
              "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat",
              "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup",
              "fork", "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", "broccoli",
              "carrot", "hot dog", "pizza", "donut", "cake", "chair", "sofa", "pottedplant", "bed",
              "diningtable", "toilet", "tvmonitor", "laptop", "mouse", "remote", "keyboard", "cell phone",
              "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors",
              "teddy bear", "hair drier", "toothbrush"
              ]

# Main detection loop
while True:
    success, img = cap.read()  # Capture frame from webcam

    # Run YOLO inference on the captured frame
    result = model(img, stream=True)  # 'stream=True' for generator output

    # Process detection results
    for r in result:
        boxes = r.boxes  # Get bounding boxes from results
        for box in boxes:
            # Extract and convert bounding box coordinates to integers
            x1, y1, x2, y2 = box.xyxy[0]
            x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2)

            # Calculate width and height for CVZone's corner rectangle
            w, h = x2 - x1, y2 - y1
            cvzone.cornerRect(img, (x1, y1, w, h))  # Draw enhanced bounding box

            # Calculate and print confidence score (rounded to 2 decimals)
            conf = math.ceil((box.conf[0] * 100)) / 100
            print(conf)  # Output confidence to console for debugging

            # Get class ID and display class name with confidence
            cls = int(box.cls[0])
            cvzone.putTextRect(img, f'{classNames[cls]} {conf}',
                               (max(0, x1), max(35, y1)),  # Prevent text going off-screen
                               scale=0.7, thickness=1)

    # Display processed frame in window
    cv2.imshow("Image", img)
    cv2.waitKey(1)  # 1ms delay between frames (keeps window responsive)

Copy and paste the code above into your main.py file. Now, let’s break down the code to understand what’s happening.

The script begins by importing essential libraries math for rounding confidence scores, cv2 (Opencv) for camera and image operations, cvzone for simplified annotations, and YOLO from Ultralytics to load the YOLOv8 model. The webcam is initialized with a resolution of 1280x720 pixels using cv2.VideoCapture(0), ensuring high-definition input for better detection accuracy.

The YOLOv8 nano model (yolov8n.pt), pretrained on the COCO dataset, is loaded to detect 80 common object classes (e.g., "person," "car," "laptop"). These class names are stored in the classNames list for later labeling.

In the main loop, the script continuously captures frames from the webcam. Each frame is passed to the YOLO model with stream=True, enabling efficient processing of video streams by returning a generator. For each detected object, the bounding box coordinates are extracted in the xyxy format (top-left and bottom-right corners) and converted to integers. The width and height of the box are calculated to draw a rounded-corner rectangle around the object using cvzone.cornerRect(), enhancing visual clarity.

The detection confidence score is rounded to two decimal places using math.ceil() and printed to the console for debugging. The class name and confidence are displayed above the bounding box with cvzone.putTextRect(), which automatically positions the text to avoid overflow outside the frame.

Processed frames are displayed in a window titled "Image," with a 1ms delay between iterations to keep the interface responsive. The loop runs indefinitely until the user terminates it by pressing a key. This pipeline demonstrates real-time object detection with minimal code, leveraging pre-trained models and helper libraries for streamlined development.

In this live implementation, you’ll observe the system detecting objects from the COCO dataset (e.g., "person," "laptop," "cell phone") in real time. Detected objects are highlighted with bounding boxes labelled with their class name and confidence score (e.g., "person 0.92").

Figure 2: Illustration showing the YOLO Model detects a person (92% confidence) and a banana (86% confidence).

Figure 3: Illustration showing the YOLO Model detection of multiple persons, a book and a cup with different confidence levels.

Figure 4: Illustration showing the YOLO Model detects a person (79% confidence) and a bottle (93% confidence).

Figure 4: Illustration showing the YOLO Model detects a person (90% confidence) and a cell phone (46% confidence).

Computer vision stands at the forefront of technological innovation, transforming raw visual data into actionable intelligence across industries. Its applications are redefining boundaries from enabling life-saving medical diagnostics through precise tumor detection to empowering sustainable agriculture with crop health monitoring systems. In security, facial recognition systems now authenticate identities with near-human accuracy, while autonomous vehicles leverage real-time object detection to navigate complex environments safely.

The field’s rapid evolution is driven by advancements in edge computing, which brings real-time analysis to devices like drones and IoT sensors, and ethical AI frameworks that address privacy concerns in public surveillance. As generative models and 3D vision push the limits of what machines can "see," industries from retail to disaster response are discovering unprecedented efficiencies.

To aspiring innovators: Dive into open-source frameworks like OpenCV or PyTorch, experiment with custom object detection models, or contribute to projects tackling bias mitigation in training datasets. Computer vision isn’t just a tool, it’s a bridge between the physical and digital worlds, inviting collaborative solutions to global challenges. The next frontier? Systems that don’t just interpret visuals, but contextualise them with human-like reasoning. Start building, and you’ll shape how tomorrow’s machines perceive reality.

Contact me for tailored AI strategy workshops.

☕ Support My Efforts:

If you enjoy this guide, consider buying me a coffee to help me create more content like this!