Topics: C++/OpenCV/CUDA/MPI/OPENMP/Image-Stitching/

Accelerating Feature Extraction and Image Stitching Algorithm Using Nvidia CUDA

“The images attempt to capture scientific thought. They represent the physical manifestation of the thought process. Everything in the laboratory is a product of a stream of conscious or unconscious thought.”

- Peter Fraser

Feature Extraction from Images and the ORB Feature Detector

Introduction

Assuming that you are in a region where natural disasters rampage. Landscape can change, buildings can collapse, and outdated map requires immediate, detailed, and accurate updates. This problem can be addressed by using Unmanned Aerial Vehicle(s) (UAV) to capture images of the area, and then stitch them together using existing computer-vision technologies. The true issue lies in how fast we can do this, as accurate updates mean that multiple high-quality images will be processed, and thus slows down the process due to highly expensive computational processes. One option is multi-core and multi-threads processing. But can we do better? The answer is yes, through the modern architecture behind the Graphic Processing Unit (GPU).

Image stitching requires feature detection algorithms. The first widely-used feature detection algorithm was the Harris Corner Detector, introduced in 1988 by Chris Harris and Mike Stephens. Later on, in 1999, the Scale-invariant feature transform (SIFT) algorithm by David G. and Lowe revolutionized the field of computer vision by providing a reliable method for feature detection and description. It builds upon the Harris Corner Detector but adds many key innovations to improve robustness and functionality. The SIFT algorithm is still widely used today, and most papers regarding new methods in feature detection, either through traditional means, or neural networks, use SIFT as their benchmark.

Although SIFT is a reliable method for such tasks, it is also very computationally expensive. Therefore, many more methods throughout the years have been developed to address that issue, and in 2011, Oriented FAST and Rotated BRIEF (ORB) was introduced. After multiple benchmark tests, ORB has been proven to have better overall performance with quick computing and is demonstrated to be robust to light and rotational shifts.

The Capital Building of the United States. Feature detection using the Oriented FAST and Rotated BRIEF (ORB) Algorithm. The colored circular regions are called key points of the image.

Project's Objective

This project mainly focuses on images captured by Unmanned Aerial Vehicles (UAV), the image patch is then sent to a central computer with an Nvidia graphic card for image processing. The map generation process involves processing a pair of images, then the algorithm will detect, and match each of the key points from one image to another. The existing work of this project involves an algorithm to process the image patch using multi-core CPUs, which are available on most modern computers.

The original author of the project, my mentor, also addressed the issue of increasing data overhead by employing distributed and shared memory architectures to reduce computation time, or in other words, multi-threading. For this project, we will employ all the previously mentioned techniques, but we will also integrate the Graphic Processing Unit through the use of CUDA. The goal is to generate reliable maps and accelerate image stitching algorithms even further by utilizing multi-core, multi-threads, and GPU computation. Performance is a large emphasis for this project; therefore, C++ will be prioritized instead of Python.

Final Map Obtained from Stitching Process

Literature Review and Technologies

The SIFT and ORB Feature Detectors

Scale-Invariant Feature Transform. SIFT is one of the most well-known and reliable feature detectors in the field of computer vision to extract invariant features from an image. SIFT primarily works based on the Difference of Gaussian (D.o.G), defined as

\begin{equation} D(\mathbf{x},\sigma) = \frac{1}{2\pi\sigma}\left[\frac{1}{k}\exp\lbrac-\frac{\lVert\mathbf{x}\rVert^2}{2(k\sigma)^2}\rbrac - \exp\lbrac-\frac{\lVert\mathbf{x}\rVert^2}{2\sigma^2}\rbrac \right]. \end{equation}

Since

\begin{equation} D(\mathbf{0},\sigma) = \frac{1}{2\pi k \sigma} < \frac{1}{\sqrt{2}}D_{max}\quad\text{and}\quad D(\mathbf{\infty},\sigma)=0, \end{equation}

the D.o.G is a bandpass filter, which means that it only lets selected frequencies through. One characteristic of the bandpass filter is that it lets frequencies greater than $\frac{1}{\sqrt{2}}D_{max}$ through and blocks low frequencies. This means that the frequencies that it passes tend to be the areas of high contrast on the image, which are usually edges (In computer vision, edges are abrupt changes in intensity, discontinuity in image brightness, or contrast).

The algorithm then uses the Gaussian pyramid to compare the respective pixel with its surrounding 26 other neighbor pixels to look for the extrema of the image and improve the accuracy by the Taylor series method. SIFT then uses a straightforward approach for assigning orientations. For each image sample, $L(x,y)$, we compute the gradient magnitude $M(x,y)$ and orientation angle $\theta(x,y)$ using differences of pixels:

\begin{equation} M(x,y) = \sqrt{(L(x+1, y) - L(x-1,y))^2 + (L(x,y+1) - L(x,y-1))^2} \end{equation}

and,

\begin{equation} \theta(x,y) = \tan^{-1}\lbrac\frac{L(x,y+1) - L(x,y-1)}{L(x+1, y) - L(x-1,y)}\rbrac. \end{equation}

The final output is an $n$-dimensional feature vector whose elements are invariant feature descriptors. In 2018, authors Griwodz, Calvet, and Halvorsen proposed a Python library called PopSIFT, which serves as a CUDA implementation of the SIFT algorithm.

Oriented FAST and Rotated BRIEF. FAST, short for “Features from Accelerated Segment Test,” is a widely recognized corner detection algorithm introduced by Edward Rosten and Tom Drumm in their paper “Machine Learning for High-speed Corner Detection.” FAST identifies corners by examining a circular region around a pixel to determine if the pixel qualifies as a corner. The ORB (Oriented FAST and Rotated BRIEF) algorithm uses FAST-9, which uses a circle with a radius of 9 pixels to optimize performance. However, because the original FAST algorithm does not account for orientation, ORB enhances it by assigning an orientation to each detected keypoint (similar to SIFT), thereby making the algorithm rotation-invariant. This orientation is calculated using the intensity centroid method, which leverages the moments of the patch around the keypoint, defined as:

\begin{equation} m_{pq} = \sum_{x,y}x^py^qI(x,y) \end{equation}

These moments are used to determine the dominant orientation of the keypoint. The centroid can be then computed by

\begin{equation} C = \lbrac\frac{m_{10}}{m_{00}},\frac{m_{01}}{m_{00}}\rbrac \end{equation}

By optimizing the BRIEF (Binary Robust Independent Elementary Features) algorithm, binary descriptors are generated by simple pixel intensity comparison, and Angle invariance is recognized by shifting the descriptors in the main direction.

Overview of General GPU Architecture and Nvidia’s Compute Unified Device Architecture (CUDA)

Unlike a Central Processing Unit (CPU), which typically has only a few physical cores, a Graphics Processing Unit (GPU) can contain thousands or even tens of thousands of cores. Although GPU cores are designed for simple calculations and are less versatile than CPU cores, the GPU's key advantage lies in its ability to perform large-scale parallel computations. Take a look at an image, for instance:

Zooming into pixels — Zooming into an image. Image by Julie Waterhouse Photography

Images are composed of millions of pixels. When processing an image, the GPU cannot handle pixels one by one (it shouldn't!), as doing so would leave many cores idle, leading to a significant waste of resources. Instead, a key principle of GPU operation is parallelism: whichever core is available first takes on the computation, ensuring that all cores are utilized efficiently. All computations are independent of each other, and finally, we only need to combine them to output the final input image.

Let's say we have 100 tasks for a GPU to compute, with each task represented as a thread. GPUs are designed to handle thousands or even millions of threads simultaneously, but they don't assign one thread to each core directly. Instead, GPU cores are organized into groups called "warps," typically consisting of 32 threads that execute the same instruction simultaneously. These warps are managed by units within the GPU called Streaming Multiprocessors (SMs).

If the number of tasks increases to 1 million, the GPU doesn't need 1 million cores to process them. Instead, it groups the threads into blocks, and these blocks are distributed among the SMs. Each SM contains multiple cores that work together to execute the warps of threads. When an SM processes a block, it schedules the warps dynamically, keeping all its cores busy and switching between warps to hide any delays, such as waiting for memory access. For example, with 1,000 cores organized into several SMs, each SM might handle many blocks of threads. Even if a single block contains 1,000 threads, the SM can manage them efficiently by processing them in warps and switching between them as needed. This hierarchical structure allows the GPU to handle a vast number of threads, ensuring that even with millions of tasks, the cores are utilized efficiently to maximize parallel computation.

Message Passing Interfaces (MPI)

What is an MPI? An MPI is a software designed to allow communication between CPU cores. MPI defines the syntax and semantics of library routines that allow programs to communicate data by passing messages between processes. For this research project, we will be using the Microsoft Message Passing Interface (MsMPI).

Open Multi-Processing (OpenMP)

OpenMP is an Application Program Interface (API) that may be used to direct multi-threaded, shared memory parallelism in C/C++, or Fortran programs. Both g++ (available since GCC 4.9) and MSVC compiler have OpenMP available through the header #include<omp.h>. To enable OpenMP in Microsoft Visual Studio, go to Project > Properties > C/C++ > Language > OpenMP support, select yes. For g++, you must add -fopenmp to the compilation of any module that uses OpenMP:

    g++ -c test.cpp -o test.o -fopenmp
    g++ test.o -o test -fopenmp -lpthread

Implementations

Image Stitching Workflow

The code base of this project belongs to a private repository, so I will not put the full code on here. It wouldn't make much sense anyway because this whole process spans across thousands of lines code. Instead, I will be summarizing the image stitching process using the flow chart below:

In order to further improve the speed of image processing, the author of the CPU implementation used a distributed memory design to parallelize image processing. Distributed memory designs transfer information between multiple CPU cores through MPI, with each or more CPU cores processing two images as a node. The greater the number of CPU cores, the greater the number of images that can be processed simultaneously in parallel without the need to wait for earlier images in the list to complete processing before processing later images. For example, if one wishes to concatenate four images, one can use two CPU cores for parallel processing.

When the two nodes are finished processing, the results are sent to the new node via MPI to process the two images processed by the previous node. This structure is similar to the tree structure and has been shown to increase the computation speed. However, one needs to note that given an unlimited amount of resources, it does not necessarily mean that the program can speed up infinitely, see Amdahl's law, which states that the potential speedup of a process due to parallelization is limited by the portion of the process that cannot be parallelized, which can be formalized by the equation

\begin{equation} S = \frac{1}{(1-P) + \frac{P}{N}}, \end{equation}

where $S$ is the speedup ratio, $P$ is the proportion of the task that can be parallelized, $N$ is the number of processors or cores, and $(1-P)$ is the proportion of the task that must be performed sequentially.

Optimization

The main change from the original paper is to convert the method of image feature extraction using CPU to GPU. We will add a function based on the code of the original paper that can detect image features using the GPU. We changed the finder of feature detection to use cuda::ORB, and based on the open source code of OpenCV CUDA, we used the corresponding CUDA ORB image feature detection algorithm to achieve the same purpose as its regular version. We call this new function

    void computeImageFeatures(Ptr<cuda::ORB> orb, const cuda::GpuMat& img,ImageFeatures& features)

to compute GPU images. This is an analogous version of the detectAndCompute function when using regular feature detector ORB. The key feature in this function is the computation of descriptors on the GPU. To do this, we must upload the descriptors from the CPU to the GPU, and then download it back to the CPU.

Data Collection and Benchmarks

The data set for this project is a set of sixty-four 4k images, each with a resolution of 3840 $\times$ 2160 px. We will utilize CUDA to accelerate the processing of these high-resolution images.

CPU Implementation. We will evaluate the algorithm's performance under different conditions, including with and without CUDA acceleration. In the absence of CUDA acceleration, we will record the impact of distributed memory design (utilizing MPI) and shared memory design (utilizing OpenMP) on the algorithm's speed. Given our limited resources, we will test the processing speed using 1, 2, 4, 8, 16, and 32 CPU cores, as well as using 1, 2, 4, 8, 16, 24 threads, and 32 threads.

GPU Implementation. For the CUDA-accelerated version of the algorithm, the tests will concentrate exclusively on the distributed memory design. This focus is due to the current limitations of the experiment, which prevent us from utilizing the GPU implementation across multiple threads. In simple words, OpenMP is not available for CUDA. I am still working on this problem.

Performance Testing without CUDA Acceleration

Benchmark Test: Record the processing speed of 64 images without any optimization, which will serve as the initial benchmark for the experiment.
Single-Core Execution (Serial Processing): Test the processing speed of 64 images using a single CPU core.
Distributed Memory Design (Parallel Processing): Test the processing speed for 1, 2, 4, 8, 16, and 32 CPU cores.
Single-Core Execution and Shared Memory Design (Serial Threading): Test the processing speed for 1, 2, 4, 8, 16, 24, and 32 threads with 1 core.
Distributed Memory and Shared Memory Design (Parallel Processing and Multi-threading): Test the processing speed using combinations such as 2 cores 2 threads, 2 cores 4 threads, 4 cores 2 threads, 4 cores 4 threads, etc.

Performance Testing with CUDA Acceleration

Benchmark Test: Record the processing speed of 64 images under the optimized design of 1 core and 1 threads with CUDA was recorded as the second part experimental initial benchmark.
Single-Core Execution (Serial Processing): Test the processing speed of 64 images using a single CPU core and GPU acceleration with 1 core.
Distributed Memory Design (Parallel Processing and Serial-threading): Test the processing speed using 1, 2, 4, 8, 16, and 24 CPU cores with GPU acceleration.

Experimental Setup: The data collection process will be conducted on two different computers:

	First Setup	Second Setup
CPU	Intel Core i7-8750H	Intel Core i9-13900KF
GPU	NVIDIA GTX 1070 Max-Q	NVIDIA RTX 4060Ti
OpenCV Version	OpenCV 4.10.0	OpenCV 4.10.0
RAM	16GB	32GB
GPU Memory	8GB	16GB

There will be some differences in the data obtained from the two setups. The CPU used in lab setup 1 has only 6 cores, so the data obtained using setup 1 will be only from 1, 2, 4, and at most 8 cores (overhead). We also use different running memories to compare whether the calculation results of the algorithm will be affected by different running memories.

Results

Performance Metrics

The serial and parallel runtime of a program are defined as follows:

The Serial runtime of a program is the time between the beginning and the end of its execution on a sequential processor. We denote this by $T_S$.
Parallel runtime is the time from the moment the first processor begins its execution to the moment the last processor ends its execution. We denote this by $T_P$.

Speedup is defined as the ratio between the serial runtime to the time taken by parallel runtime,

\begin{equation} S=\frac{T_S}{T_P} \end{equation}

The speedup of a parallel algorithm measures how much faster an algorithm is than its sequential counterpart.

The cost of processing a program on a parallel system is defined as the product of run time and the number of processors,

\begin{equation} C = T_P\cdot p, \end{equation}

where $p$ denotes the number of processors.

The efficiency of a parallel program that uses $p$ processors is defined by

\begin{equation} E = \frac{T_S}{p\cdot T_P}=\frac{S}{p} \end{equation}

Runtime Comparison

As mentioned, I will compare the two runtime of the setup with, and without multi-threading.

Without OpenMP for CPU vs GPU (1 thread)

With OpenMP for CPU vs GPU (1 thread)

I initially ran our code on Setup 1. The original goal of this research is to be able to stitch 64 4K images, but due to the limited resources of the first setup, I had to downscale the image by 75% from their original quality to 540$\times$960 px.

The GPU implementation significantly outperformed the original CPU-based image stitching, reducing the runtime by a factor of three to four. However, as the number of used cores exceeded the available physical cores, the application began to slow down: all components of the system were operating at their limits: the CPU was running at 100%, the RAM was fully utilized at 16GB, and the GPU memory was maxed out at 8GB. This likely due to the context switching that the system had to perform when running out of physical cores.

One interesting thing for the first setup is that the GPU implementation was able to run up to 8 CPU cores, but the original implementation could only run up to 4 cores. One possible reason for this is that the GPU shared the resources with the CPU while performing the algorithm, which put less constraint on the CPU, and hence pushed the number of cores to the limit.

Speedup

Even though the runtime has massively decreased under the usage of CUDA, the speedup ratio does not stand out among the CPU with the OpenMP dataset, but it did, however, speedup in all scenarios of cores against CPU with 1 thread, which is reasonable since our GPU setup also uses only one thread of the CPU (no OpenMP available for GPU).One possible reason why the GPU doesn’t speed up as fast is due to the uploading and downloading of images to the GPU, which costs much more resources due to the high quality of the image patch.

Cost

In parallel computing, resource inefficiency occurs when parts of a process are forced to wait for others to complete, reducing overall performance. As observed, the cost of the GPU implementation is much lower than every other previous CPU implementations. This is not a surprise at all, as GPUs, optimized for massively parallel tasks, often provide significantly better performance for certain applications compared to traditional CPU implementations, resulting in lower operational costs for these specific workloads.

Efficiency

For efficiency, the CUDA-enhanced algorithm is less efficient than the CPU with 1 thread setup, but is overall more efficient than other CPU setups as the number of cores increases. The efficiency of the GPU implementation eventually matches that of the CPU, 1 thread implementation as $p=32$.

Ending Remarks and Future Work

One significant drawback to the CUDA-accelerated algorithm is that it is a generally memory-costly method. However, if the user lacks a high-quality camera or needs to create a map quickly without focusing on fine details, the image patch can be downscaled to meet these requirements. This approach has been validated in the first setup, where the 4K patch was downscaled by 75% and worked perfectly. This can be achieved easily by modifying a small section in the code, but this is not the main discussion of this article. Another issue with the 4K patch is that it was not able to stitch parts of the map where there are rotations due to memory issues. After several hours and careful inspection, it is deduced that this is not a problem with the code, but due to memory issues, since if we stitch the same patch with downscaled quality, it was able to stitch the patch successfully without any problem.

Integrating shared memory design for multi-threading with the GPU is still an open problem, and I plan to address this in future projects to accelerate the algorithm even further. The ultimate goal of this project is to process 64 frames of 4K images in one second. Another objective is perhaps to resurvey the memory management in the code base to ensure that the program does not crash when the input is too large and to generate more consistent and reliable results.