Why Deep Learning Models Run Faster on GPUs: A Brief Introduction to CUDA Programming

For those who want to understand what .to(“cuda”) does

Image by the author with the assistance of AI (https://copilot.microsoft.com/images/create)

Nowadays, when we talk about deep learning, it is very common to associate its implementation with utilizing GPUs in order to improve performance.

GPUs (Graphical Processing Units) were originally designed to accelerate rendering of images, 2D, and 3D graphics. However, due to their capability of performing many parallel operations, their utility extends beyond that to applications such as deep learning.

The use of GPUs for deep learning models started around the mid to late 2000s and became very popular around 2012 with the emergence of AlexNet. AlexNet, a convolution neural network designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. This victory marked a milestone as it demonstrated the effectiveness of deep neural networks for image classification and the use of GPUs for training large models.

Following this breakthrough, the use of GPUs for deep learning models became increasingly popular, which contributed to the creation of frameworks like PyTorch and TensorFlow.

Nowadays, we just write .to("cuda") in PyTorch to send data to GPU and expect the training to be accelerated. But how does deep learning algorithms take advantage of GPUs computation performance in practice? Let’s find out!

Deep learning architectures like neural networks, CNNs, RNNs and transformers are basically constructed using mathematical operations such as matrix addition, matrix multiplication and applying a function a matrix. Thus, if we find a way to optimize these operations, we can improve the performance of the deep learning models.

So, let’s start simple. Imagine you want to add two vectors C = A + B.

A simple implementation of this in C would be:

void AddTwoVectors(flaot A[], float B[], float C[]) {
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}

As you can notice, the computer must iterate over the vector, adding each pair of elements on each iteration sequentially. But these operations are independent of each other. The addition of the ith pair of elements does not rely on any other pair. So, what if we could execute these operations concurrently, adding all of the pairs of elements in parallel?

A straightforward approach would be using CPU multithreading in order to run all of the computation in parallel. However, when it comes to deep learning models, we are dealing with massive vectors, with millions of elements. A common CPU can only handle around a dozen threads simultaneously. That’s when the GPUs come into action! Modern GPUs can run millions of threads simultaneously, enhancing performance of these mathematical operations on massive vectors.

GPU vs. CPU comparison

Although CPU computations can be faster than GPU for a single operation, the advantage of GPUs relies on its parallelization capabilities. The reason for this is that they are designed with different goals. While CPU is designed to execute a sequence of operations (thread) as fast as possible (and can only execute dozens of them simultaneously), the GPU is designed to execute millions of them in parallel (while sacrificing speed of individual threads).

See the video below:

To illustrate, imagine that a CPU is like a Ferrari, and the GPU as a bus. If your task is to move one person, the Ferrari (CPU) is the better choice. However, if you are moving several people, even though the Ferrari (CPU) is faster per trip, the bus (GPU) can transport everyone in one go, transporting all people at once faster than the Ferrari traveling the route several times. So CPUs are better designed for handling sequential operations and GPUs for parallel operations.

Image by the author with the assistance of AI (https://copilot.microsoft.com/images/create)

In order to provide higher parallel capabilities, GPU designs allocate more transistors for data processing than to data caching and flow control, unlike CPUs which allocate a significant portion of transistors for that purpose, in order to optimize single-threaded performance and complex instruction execution.

The figure below illustrates the distribution of chip resources for CPU vs GPU.

Image by the author with inspiration from CUDA C++ Programming Guide

CPUs have powerful cores and a more complex cache memory architecture (allocating a significant amount of transistors for that). This design enables faster handling of sequential operations. On the other hand, GPUs prioritize having a large number of cores to achieve a higher level of parallelism.

Now that we understood these basic concepts, how can we take advantage of this parallel computation capabilities in practice?

Introduction to CUDA

When you are running some deep learning model, probably your choice is to use some popular Python library such as PyTorch or TensorFlow. However, it is well-known that the core of these libraries run C/C++ code underneath. Also, as we mentioned before, you might use GPUs to speed up processing. That’s where CUDA comes in! CUDA stands for Compute Unified Architecture and it is a platform developed by NVIDIA for general-purpose processing on their GPUs. Thus, while DirectX is used by game engines to handle graphical computation, CUDA enables developers to integrate NVIDIA’s GPU computational power into their general-purpose software applications, extending beyond just graphics rendering.

In order to implement that, CUDA provides a simple C/C++ based interface (CUDA C/C++) that grants access to the GPU’s virtual intruction set and specific operations (such as moving data between CPU and GPU).

Before we go further, let’s understand some basic CUDA Programming concepts and terminology:

  • host: refers to the CPU and its memory;
  • device: refers to the GPU and its memory;
  • kernel: refers to a function that is executed on the device (GPU);

So, in a basic code written using CUDA, the program runs on the host (CPU), sends data to the device (GPU) and launches kernels (functions) to be executed on the device (GPU). These kernels are executed by several threads in parallel. After the execution, the results are transfered back from the device (GPU) to the host (CPU).

So let’s go back to our problem of adding two vectors:

#include <stdio.h>

void AddTwoVectors(flaot A[], float B[], float C[]) {
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}

int main() {
...
AddTwoVectors(A, B, C);
...
}

In CUDA C/C++, the programmers can define C/C++ functions, called kernels, that when called, are executed N times in parallel by N different CUDA threads.

To define a kernel, you can use a __global__ declaration specifier, and the number of CUDA threads that execute this kernel can be specified using <<<...>>> notation:

#include <stdio.h>

// Kernel definition
__global__ void AddTwoVectors(float A[], float B[], float C[]) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int main() {
...
// Kernel invocation with N threads
AddTwoVectors<<<1, N>>>(A, B, C);
...
}

Each thread executes the kernel and is given a unique thread ID threadIdx accessible within the kernel through built-in variables. The code above adds two vectors A and B, of size N and stores the result into vector C. As you can notice, instead of a loop to execute each pair-wise addition sequentially, CUDA allows us to perform all of these operations simultaneously, using N threads in parallel.

But before we can run this code, we need to do another modification. It is important to remember that the kernel function runs within the device (GPU). So all of its data needs to be stored in the device memory. You can do this by using the following CUDA built-in functions:

#include <stdio.h>

// Kernel definition
__global__ void AddTwoVectors(float A[], float B[], float C[]) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int main() {

int N = 1000; // Size of the vectors
float A[N], B[N], C[N]; // Arrays for vectors A, B, and C

...

float *d_A, *d_B, *d_C; // Device pointers for vectors A, B, and C

// Allocate memory on the device for vectors A, B, and C
cudaMalloc((void **)&d_A, N * sizeof(float));
cudaMalloc((void **)&d_B, N * sizeof(float));
cudaMalloc((void **)&d_C, N * sizeof(float));

// Copy vectors A and B from host to device
cudaMemcpy(d_A, A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, N * sizeof(float), cudaMemcpyHostToDevice);

// Kernel invocation with N threads
AddTwoVectors<<<1, N>>>(d_A, d_B, d_C);

// Copy vector C from device to host
cudaMemcpy(C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);

}

Instead of directly passing variables A, B and C to the kernel, we need to use pointers. In CUDA programming, you can’t directly use host arrays (like A, B, and C in the example) within kernel launches (<<<...>>>). CUDA kernels operate on device memory, so you need to pass device pointers (d_A, d_B, and d_C) to the kernel for it to operate on.

Beyond that, we need to allocate memory on the device by using cudaMalloc, and copy data between host and device using cudaMemcpy.

Now we can add the initialization of vectors A and B, and refresh cuda memory at the end of the code.

#include <stdio.h>

// Kernel definition
__global__ void AddTwoVectors(float A[], float B[], float C[]) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int main() {

int N = 1000; // Size of the vectors
float A[N], B[N], C[N]; // Arrays for vectors A, B, and C

// Initialize vectors A and B
for (int i = 0; i < N; ++i) {
A[i] = 1;
B[i] = 3;
}

float *d_A, *d_B, *d_C; // Device pointers for vectors A, B, and C

// Allocate memory on the device for vectors A, B, and C
cudaMalloc((void **)&d_A, N * sizeof(float));
cudaMalloc((void **)&d_B, N * sizeof(float));
cudaMalloc((void **)&d_C, N * sizeof(float));

// Copy vectors A and B from host to device
cudaMemcpy(d_A, A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, N * sizeof(float), cudaMemcpyHostToDevice);

// Kernel invocation with N threads
AddTwoVectors<<<1, N>>>(d_A, d_B, d_C);

// Copy vector C from device to host
cudaMemcpy(C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);

// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}

Also, we need to add cudaDeviceSynchronize(); after we call the kernel. This is a function used to synchronize the host thread with the device. When this function is called, the host thread will wait until all previously issued CUDA commands on the device are completed before continuing execution.

Beyond that, it is important to add some CUDA error checking so we can identify bugs on GPU. If we do not add this checking, the code will continues execution of the host thread (CPU) and it will be difficult to identify CUDA related errors.

The implementation of both techniques below:

#include <stdio.h>

// Kernel definition
__global__ void AddTwoVectors(float A[], float B[], float C[]) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int main() {

int N = 1000; // Size of the vectors
float A[N], B[N], C[N]; // Arrays for vectors A, B, and C

// Initialize vectors A and B
for (int i = 0; i < N; ++i) {
A[i] = 1;
B[i] = 3;
}

float *d_A, *d_B, *d_C; // Device pointers for vectors A, B, and C

// Allocate memory on the device for vectors A, B, and C
cudaMalloc((void **)&d_A, N * sizeof(float));
cudaMalloc((void **)&d_B, N * sizeof(float));
cudaMalloc((void **)&d_C, N * sizeof(float));

// Copy vectors A and B from host to device
cudaMemcpy(d_A, A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, N * sizeof(float), cudaMemcpyHostToDevice);

// Kernel invocation with N threads
AddTwoVectors<<<1, N>>>(d_A, d_B, d_C);

// Check for error
cudaError_t error = cudaGetLastError();
if(error != cudaSuccess) {
printf("CUDA error: %sn", cudaGetErrorString(error));
exit(-1);
}

// Waits untill all CUDA threads are executed
cudaDeviceSynchronize();

// Copy vector C from device to host
cudaMemcpy(C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);

// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}

To compile and run the CUDA code, you’ll need to ensure that the CUDA toolkit is installed on your system. Then, you can compile the code using nvcc, the NVIDIA CUDA Compiler. If you don’t have a GPU on your machine, you can use Google Colab. You just need to select a GPU on Runtime → Notebook settings, then save the code on a example.cu file and run:

%%shell
nvcc example.cu -o compiled_example # compile
./compiled_example # run

# you can also run the code with bug detection sanitizer
compute-sanitizer --tool memcheck ./compiled_example

However, our code still is not fully optimized. The example above uses a vector of size N = 1000. But, this is a small number that will not fully demonstrate the GPU’s parallelization capability. Also, when dealing with deep learning problem, we often handle massive vectors with millions of parameters. However, if we try settings, for example, N = 500000 and run the kernel with <<<1, 500000>>> using the example above, it will throw an error. Thus, to improve the code and perform such operation, we first need to understand an important concept of CUDA programming: Thread hierarchy.

Thread hierarchy

The calling of kernel functions is done using the notation <<<number_of_blocks, threads_per_block>>>. So, in our example above, we run 1 block with N CUDA threads. However, each block has a limit on the number of threads it can support. This occurs because every thread within a block is required to be located on the same streaming multiprocessor core and must share the memory resources of that core.

You can get this limit using the following snippet of code:

int device;
cudaDeviceProp props;
cudaGetDevice(&device);
cudaGetDeviceProperties(&props, device);
printf("Maximum threads per block: %dn", props.maxThreadsPerBlock);

On current Colab GPUs, a thread block may contain up to 1024 threads. Thus, we need more blocks to execute much more threads in order to process a massive vector in the example. Also, blocks are organized into grids, as illustrated below:

https://handwiki.org/wiki/index.php?curid=1157670 (CC BY-SA 3.0)

Now, the thread ID can be accessed using:

int i = blockIdx.x * blockDim.x + threadIdx.x;

So, our script becomes:

#include <stdio.h>

// Kernel definition
__global__ void AddTwoVectors(float A[], float B[], float C[], int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) // To avoid exceeding array limit
C[i] = A[i] + B[i];
}

int main() {
int N = 500000; // Size of the vectors
int threads_per_block;
int device;
cudaDeviceProp props;
cudaGetDevice(&device);
cudaGetDeviceProperties(&props, device);
threads_per_block = props.maxThreadsPerBlock;
printf("Maximum threads per block: %dn", threads_per_block); // 1024

float A[N], B[N], C[N]; // Arrays for vectors A, B, and C

// Initialize vectors A and B
for (int i = 0; i < N; ++i) {
A[i] = 1;
B[i] = 3;
}

float *d_A, *d_B, *d_C; // Device pointers for vectors A, B, and C

// Allocate memory on the device for vectors A, B, and C
cudaMalloc((void **)&d_A, N * sizeof(float));
cudaMalloc((void **)&d_B, N * sizeof(float));
cudaMalloc((void **)&d_C, N * sizeof(float));

// Copy vectors A and B from host to device
cudaMemcpy(d_A, A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, N * sizeof(float), cudaMemcpyHostToDevice);

// Kernel invocation with multiple blocks and threads_per_block threads per block
int number_of_blocks = (N + threads_per_block - 1) / threads_per_block;
AddTwoVectors<<<number_of_blocks, threads_per_block>>>(d_A, d_B, d_C, N);

// Check for error
cudaError_t error = cudaGetLastError();
if (error != cudaSuccess) {
printf("CUDA error: %sn", cudaGetErrorString(error));
exit(-1);
}

// Wait until all CUDA threads are executed
cudaDeviceSynchronize();

// Copy vector C from device to host
cudaMemcpy(C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);

// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

}

Performance comparison

Below a comparison of CPU and GPU computation of this adding two vector operation for different vector sizes.

Image by the author

As one can see, the advantage of GPU processing only becomes apparent with a large vector size N. Also, remember that this time comparison is only considering the execution of the kernel/function. It is not taking into account the time to copy data between host and device, which although may not be significant on most cases, it is relatively considerable in our case as we are performing only a simple addition operation. Therefore, it is important to remember that GPU computation only demonstrates its advantage when dealing with highly compute-intensive computations that are also highly parallelized.

Multidimensional threads

Okay, now we know how to increase performance of a simple array operation. But when dealing with deep learning models, we need to handle matrix and tensor operations. In our previous example, we only used one-dimensional blocks with N threads. However, it is also possible to execute multidimensional thread blocks (up to 3 dimensions). So, for convenience you can run a thread block of NxM threads if you need to run matrix operations. In this case, you could obtain the matrix rows columns indices as row = threadIdx.x, col = threadIdx.y. Also, for convenience, you can use dim3 variable type to define the number_of_blocks and threads_per_block.

The example below illustrates how to add two matrices.

#include <stdio.h>

// Kernel definition
__global__ void AddTwoMatrices(float A[N][N], float B[N][N], float C[N][N]) {
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}

int main() {
...
// Kernel invocation with 1 block of NxN threads
dim3 threads_per_block(N, N);
AddTwoMatrices<<<1, threads_per_block>>>(A, B, C);
...
}

You can also extend this example to handle multiple blocks:

#include <stdio.h>

// Kernel definition
__global__ void AddTwoMatrices(float A[N][N], float B[N][N], float C[N][N]) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N) {
C[i][j] = A[i][j] + B[i][j];
}
}

int main() {
...
// Kernel invocation with 1 block of NxN threads
dim3 threads_per_block(32, 32);
dim3 number_of_blocks((N + threads_per_block.x - 1) ∕ threads_per_block.x, (N + threads_per_block.y - 1) ∕ threads_per_block.y);
AddTwoMatrices<<<number_of_blocks, threads_per_block>>>(A, B, C);
...
}

You can also extend this example to process 3-dimensional operations using the same idea.

Now that you know how to operate multidimensional data, there is another important and simple concept to learn: how to call functions within a kernel. Basically this is simply done by using a __device__ declaration specifier. This defines functions that can be called by the device (GPU) directly. Thus, they can only be called from __global__ or another __device__ function. The example below apply a sigmoid operation to a vector (very common operation on deep learning models).

#include <math.h>

// Sigmoid function
__device__ float sigmoid(float x) {
return 1 / (1 + expf(-x));
}

// Kernel definition for applying sigmoid function to a vector
__global__ void sigmoidActivation(float input[], float output[]) {
int i = threadIdx.x;
output[i] = sigmoid(input[i]);

}

So, now that you know the basic important concepts of CUDA programming, you can start creating CUDA kernels. In the case of deep learning models, they are basically a bunch of matrix and tensor operations such as sum, multiplication, convolution, normalization and others. For instance, a naive matrix multiplication algorithm can be parallelized as follows:

// GPU version

__global__ void matMul(float A[M][N], float B[N][P], float C[M][P]) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;

if (row < M && col < P) {
float C_value = 0;
for (int i = 0; i < N; i++) {
C_value += A[row][i] * B[i][col];
}
C[row][col] = C_value;
}
}

Now compare this with a normal CPU implementation of two matrices multiplication below:

// CPU version

void matMul(float A[M][N], float B[N][P], float C[M][P]) {
for (int row = 0; row < M; row++) {
for (int col = 0; col < P; col++) {
float C_value = 0;
for (int i = 0; i < N; i++) {
C_value += A[row][i] * B[i][col];
}
C[row][col] = C_value;
}
}
}

You can notice that on the GPU version we have less loops, resulting in a faster processing of the operation. Below is a comparison of performance between CPU and GPU of NxN matrix multiplications:

Image by the author

As you may observe, the performance improvement of GPU processing is even higher for matrix multiplication operations as the matrix size increases.

Now, consider a basic neural network, which mostly involves y = σ(Wx + b) operations, as shown below:

Image by the author

These operations primarily comprise matrix multiplication, matrix addition, and applying a function to an array, all of which you’re already familiar with the parallelization techniques. Thus, you are now capable of implementing your own neural network that runs on GPUs from scratch!

Conclusion

In this post we covered introductory concepts regarding GPU processing to enhance deep learning models performance. However, it is also important to mention that the concepts you have seen are only the basics and there is a lot more to be learned. Libraries like PyTorch and Tensorflow implement optimization techniques that involves other more complex concepts such as optimized memory access, batched operations and others (they harness libraries built on top of CUDA, such as cuBLAS and cuDNN). However, I hope this post helps clear up what goes on behind the scenes when you write .to("cuda") and execute deep learning models on GPUs.

In future posts, I will try to bring more complex concepts regarding CUDA Programming. Please let me know what you think or what you would like me to write about next in the comments! Thanks so much for reading! 😊

Further reading

CUDA Programming Guide — NVIDIA CUDA Programming documentation.

CUDA Documentation — NVIDIA complete CUDA documentation.

CUDA Neural Network training implementation — Pure CUDA C++ implementation of a neural network training.

CUDA LLM training implementation — Training implementation of LLM with pure CUDA C.