GPU Programming Duy-Phuc Pham Sep. 2021







# Whoami

- Start Sta
- Machine/Deep-Learning, Malware analysis, Side-channel.



# Outline

- 1. Introduction to GPU
- 2. Introduction to CUDA C/C++
- 3. Rainbow table/hash generator on GPU





# 1. Introduction to GPU

- Graphics processing unit
  - creation of images for output to a display device.



# A electronic circuit designed to rapidly use memory to accelerate the

Gaming software companies, movie companies and medical research



# CPU vs. GPU

# CPU

- General-purpose capabilities, mostly sequential operations
- Established technology
- Usually equipped with 8 or fewer, powerful cores
- Optimal for some types of concurrent processes but not large scale parallel computations

# **GPU/DSP/FPGA**:

•

•

- Initially created specifically for graphics
- Became more capable of general computations
- Very fast and powerful, computationally
- Uses lots of electrical power
- This allows us to run many threads simultaneously with virtually no context switches



# Data parallelism



Ref: Programming Massively Parallel Processors A Hands-On Approach by David B. Kirk et al



# GPU considerations

- Not all problems are parallelizable
- GPU performance drops dramatically if the code branches •
- CPU and GPU code can overlap execution •
- GPUs do not make function calls efficiently, and cannot recurse •
- 0 modulus, 64bit integer division etc.)
- GPUs have a limited number of registers available to use. 0
- Global memory accesses are very slow
- Local shared memory is very fast



Certain operators are very fast on GPUs, and other functions are not. (e.g Integer division,



# GPU & Global memory





# CUDA vs OpenCL

|  | Comparison               | CUDA                                                                                 |  |  |  |  |  |  |  |
|--|--------------------------|--------------------------------------------------------------------------------------|--|--|--|--|--|--|--|
|  | Performance              | _                                                                                    |  |  |  |  |  |  |  |
|  | Vendor<br>Implementation | Implemented by only NVIDIA device                                                    |  |  |  |  |  |  |  |
|  | OSS vs Commercial        | Proprietary framework of NVIDIA                                                      |  |  |  |  |  |  |  |
|  | OS Support               | Supported on the leading Operating the only distinction of NVIDIA hardward used      |  |  |  |  |  |  |  |
|  | Libraries                | Has extensive high performance librar                                                |  |  |  |  |  |  |  |
|  | Community                | Has a larger community                                                               |  |  |  |  |  |  |  |
|  | Technicalities           | Not a language but a platform and promodel that achieves parallelization us keywords |  |  |  |  |  |  |  |

# Ref: incredibuild.com

# **OpenCL**

Implemented by TONS of vendors including AMD, NVIDIA, Intel, Apple, Radeon etc.

Open Source standard

| g systems with<br>ware must be | Supported on various Operating Systems                                                                               |
|--------------------------------|----------------------------------------------------------------------------------------------------------------------|
| oraries                        | Has a good number of libraries which can be used<br>on all OpenCL compliant hardware but not as<br>extensive as CUDA |
|                                | Has a growing community not as large as CUDA                                                                         |
| programming<br>using CUDA      | Does not enable for writing code in C++ but works<br>in a C programming language resembling<br>environment           |





# 2. CUDA

- CUDA platform
  Export GPU APIs for general purpose
  CUDA C/C++
  Based on C/C++ standard
  - APIs to manage GPU devices, memory etc.





# Goals

- Write and execute starter C code on GPU
- Manage GPU memory
- Communication and synchronization





# Requirements

- Virtual (remote) / Physical access to GPU devices
- C/C++ experience
- NO graphic rendering, GPU/parallel computing experience needed



# Recommendation

- CUDA C Programming Guide v9.1 | April 2018
- Programming Massively Parallel Processors A Hands-On Approach by David B. Kirk, Wen-Mei W Hwu
- CUDA C/C++ Basics (nVidia Corp.)
- GPU Programming CS179 Caltech
- CUDA Thread Indexing Cheatsheet



THIRD EDITION Programming Massively Parallel Processors A Hands on Approach



# Heterogeneous computing



Ref: business insider



# Host: CPU and host memory (RAM) Device: GPU and device memory (VRAM)



# Typical GPU computing workflow

- Setup inputs on the host (CPU-accessible memory)
- Allocate memory for outputs on the host CPU
- Allocate memory for inputs on the GPU
- Allocate memory for outputs on the GPU
- Copy inputs from host to GPU (*slow*)
- Start GPU **kernel** (function that executes on gpu fast!)
- Copy output from GPU to host (slow)



# Simple Processing Flow



# 1. Copy input data from CPU memory to GPU memory

# Ref: CUDA C/C++ Basics NVIDIA

© NVIDIA Corporation 2011







# Simple Processing Flow



- Copy input data from CPU memory to GPU memory
- Load GPU code and execute it, caching data on chip for performance

# Ref: CUDA C/C++ Basics NVIDIA

© NVIDIA Corporation 2011







# Simple Processing Flow



- Copy input data from CPU memory to GPU memory
- Load GPU program and execute, caching data on chip for performance
- Copy results from GPU memory to CPU memory

Ref: CUDA C/C++ Basics NVIDIA

© NVIDIA Corporation 2011





# Helloworld nvcc hello.cu -o helloC

#include <stdio.h>
int main(void) {
printf("Hello World!\n");
return 0;
}
/\*\*
0.00s user 0.00s system 77%
cpu 0.001 total
\*\*/

printf("Hello, World\n");
return 0;

}

/\*\* 0.02s user 0.10s system 55% cpu 0.223 total \*\*/



# Syntax

# global

compiled to run on a device (GPU) instead of host (CPU).

Function mykernel() is called from host code.

Compile: nvcc hello.cu nvcc separates source code into host and device components

Device functions (e.g. mykernel()) processed by NVIDIA compiler Host functions (e.g. main()) processed by standard host compiler like gcc

# A qualifier added to standard C. This alerts the compiler that a function should be



# Simple addition (1) nvcc add.cu -o add \_\_\_global\_\_\_ void add(int \*a, int \*b, int \*c) \*c = \*a+\*b; }

Device (/Host) pointer point to Device(/Host) memory
Device (/Host) pointer maybe passed from to Host(/Device) memory

Whet INCE In Incencery rest in menory



# Simple addition (2) nvcc add.cu -o add \_\_global\_\_ void add(int \*a, int \*b, int \*c) { \*c = \*a+\*b;

- cudaMalloc(), cudaFree(), cudaMemcpy()
- Similar to C: malloc(), free(), memcpy()

Whet IVe In menory rest in menory a,b,c point to device memory -> need to allocate memory on GPU



# Simple addition (3) nvcc add.cu -o add int main(void) {



# Simple addition (4)

# nvcc add.cu -o add

// Copy inputs to device  $cudaMemcpy(d_a, \&a, size, cudaMemcpyHostToDevice);$ cudaMemcpy(d\_b, &b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<1,1>>>(d a, d b, d c);

// Copy result back to host  $cudaMemcpy(\&c, d_c, size, cudaMemcpyDeviceToHost);$ printf("%d+%d = %d\n", a, b, c);

// Cleanup cudaFree(d\_a); cudaFree(d\_b); cudaFree(d\_c); return 0; }



# Simple Parallelization (1)

- Instead of executing add () once, execute N times in parallel

# add<<<1,1>>>(d\_a, d\_b, d\_c); add <<<N, 1>>>(d a, d b, d c);

- Each parallel invocation of add() is referred to as a block •
- The set of blocks is referred to as a grid •
- Each invocation can refer to its block index using blockIdx.x



# Simple Parallelization (2)

\_global\_\_ void add(int \*a, int \*b, int \*c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

element of the array



# By using blockldx.x to index into the array, each block handles a different



# Simple Parallelization (3)

#define N 512 int main(void){ int \*a, \*b, \*c; // host copies of a, b, c int \*d\_a, \*d\_b, \*d\_c; // device copies of a, b, c <u>int size = N\*sizeof(int);</u>// Allocate space for device of a, b, c cudaMalloc((void \*\*)&d\_a, size); cudaMalloc((void \*\*)&d\_b, size); cudaMalloc((void \*\*)&d\_c, size); a = (int \*)malloc(size); b = (int \*)malloc(size); c = (int \*)malloc(size); <u>for(int i=0; i<N; i++)</u> a[i] = -i;b[i] = i \* i \* i;



# Simple Parallelization (4)

// Copy inputs to device cudaMemcpy(d\_a, <u>a</u>, size, cudaMemcpyHostToDevice); cudaMemcpy(d\_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add <<< N, 1 >>> (d a, d b, d c);// Copy result back to host cudaMemcpy(c, d\_c, size, cudaMemcpyDeviceToHost); for(int i=0;i<N;i++) printf("%d+%d = %d\n", a[i], b[i], c[i]);</pre> // Cleanup free(a); free(b); free(c); cudaFree(d a); cudaFree(d b); cudaFree(d c); return 0; }



# GPU Grids, blocks, threads (1)

\_\_global\_\_ void add(int \*a, int \*b, int \*c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }



# GPU Grids, blocks, threads (2)

# \_\_global\_\_ void add(int \*a, int \*b, int \*c) { c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x]; }

• Terminology: a block can be split into parallel threads

We use threadIdx.x instead of blockIdx.x  $\bullet$ 





# GPU Grids, blocks, threads (3)

\_\_global\_\_ void add(int \*a, int \*b, int \*c) { c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x]; add<<<N,1>>>(d\_a, d\_b, d\_c); add<<<1,N>>>(d\_a, d\_b, d\_c);



- Limit of the number of blocks in a single launch: 65,535.
- Limit of the number of threads per block. 0 - For many GPUs, maxThreadsPerBlock = 512 (or 1024, ..).
- Blocks and threads are often combined. 0 kernel access: Array[block\_index+thread\_index]

# Combining GPU Blocks & Threads (1)



# Consider indexing an array with one element per thread (8 threads/block)



blockIdx.x = 1blockIdx.x = 0

# int index = threadIdx.x + blockIdx.x \* M;

# Combining GPU Blocks & Threads (2)

blockIdx.x = 2blockIdx.x = 3

With M threads per block, a unique index for each thread is given by:







int index = threadIdx.x + blockIdx.x \* M; 5 + \* 8; 2 = 21;

# Combining GPU Blocks & Threads (3)

| 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|

blockIdx.x = 2



Use the built-in variable blockDim.x for threads per block int index = threadIdx.x + blockIdx.x \* blockDim.x;

\_\_global\_\_ void add(int \*a, int \*b, int \*c) { int index = threadIdx.x + blockIdx.x \* blockDim.x; c[index] = a[index] + b[index];

#define N (2048\*2048) // 2\*\*22 #define THREADS\_PER\_BLOCK 512

# Combining GPU Blocks & Threads (4)

# add<<<N/THREADS\_PER\_BLOCK,THREADS\_PER\_BLOCK>>>(d\_a, d\_b, d\_c);



Avoid access beyond the array dimension. \_\_global\_\_ void add(int \*a, int \*b, int \*c, int n) int index = threadIdx.x + blockIdx.x \* blockDim.x; if (index < n) c[index] = a[index] + b[index];</pre> }

#define N (2048\*2048) #define TPB 512 add<<<(N+TPB-1)/TPB,TPB>>>(d\_a, d\_b, d\_c, N);

# Combining GPU Blocks & Threads (5)



### Unlike parallel blocks, threads have mechanisms to efficiently:

- Communicate -
- Synchronize

### Reasons of GPU Blocks & Threads

### Sharing between threads



# Sharing between threads (1)

### Example: 1D stencil calculation





# Sharing between threads (2)

### Example: 1D stencil calculation



With radius 3, each input element is read seven times





### Sharing between threads (3)

- Within a block, threads share data via shared memory
- Extremely fast on-chip memory: a user-managed cache
- Declare using <u>shared</u>, allocated per block
- Not visible to threads in other blocks



### Sharing between threads (4)

\_\_global\_\_ void stencil\_1d(int \*in, int \*out) { \_\_\_\_shared\_\_\_\_int temp[BLOCK\_SIZE + 2 \* RADIUS]; int gindex = threadIdx.x + blockIdx.x \* blockDim.x; int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) {</pre> temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK\_SIZE] = in[gindex + BLOCK\_SIZE];





### Sharing between threads (5)

// Apply the stencil int result = 0;for (int offset = -RADIUS ; offset <= RADIUS ; offset++)</pre> result += temp[lindex + offset];

// Store the result out[gindex] = result;





# Sharing between threads (6)

// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) {</pre> temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK SIZE] = in[gindex + BLOCK SIZE]; 

// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++)</pre> result += temp[lindex + offset];

BUGS

DATA RACE

DATA ACCESS VIOLATION



### Sharing between threads (7)

void syncthreads(); Similar to barrier() in C/C++: https://en.wikipedia.org/wiki/ Barrier\_(computer\_science)

WAW hazards

All threads must reach the barrier: In conditional code, the condition must be uniform across the block

### Synchronizes all threads within a block: Used to prevent RAW / WAR /



# Sharing between threads (6)

// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) {</pre> temp[lindex - RADIUS] = temp[lindex + BLOCK\_SIZE] = ((gindex + BLOCK\_SIZE)<N)?in[gindex +</pre> BLOCK SIZE]:0; }

\_\_syncthreads(); // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++)</pre> result += temp[lindex + offset];

(gindex >= RADIUS)?in[gindex - RADIUS]:0;



### Performance

- Example is used for its simplicity. •
- code
- to the amount of data processed.

 the overhead of allocating device memory, input data transfer from host to device, output data transfer from device to host, and de-allocating device memory will likely make the resulting code slower than the original sequential

This is because the amount of calculation done by the kernel is small relative



### Takeaways

Launching parallel threads
 Launch N blocks with M threads per

 Allocate elements to threads: int index = threadIdx.x + blockIdx.x \* blockDim.x;
 Use <u>shared</u> to declare a variable/array in shared memory: Data is shared between threads in a block & Not visible to threads in other blocks
 Use <u>syncthreads()</u> as a barrier: Use to prevent data hazards

### Launch N blocks with M threads per block with kernel<<<N, M>>>(...);



# Skipped topics

- Multi-dimensional indexing : •
  - A kernel is launched as a grid of blocks of threads
    - blockIdx and threadIdx are 3D •
    - We showed only one dimension (x) •
- Compute capability



# CUDA ERROR HANDLING

cudaError\_t err=cudaMalloc((void \*\*) &d\_A, size); if (error !=cudaSuccess) { printf("%s in %s at line %d\n", cudaGetErrorString(err), \_\_FILE\_\_, \_\_LINE\_\_); exit(EXIT\_FAILURE); }

Consider to use MACRO version: static void HandleError( cudaError\_t err,const char \*file, int line ) {...}

#define HANDLE ERROR(err) (HandleError( err, \_\_FILE\_, LINE\_\_))



# CUDA DEBUG

nvcc -g -G add.cu -o add cuda-gdb ./add > p blockDim.x > info cuda threads > cuda thread 1 > cuda block 1

cuda-memcheck

nvprof

Event Timers

Ref: Introduction to CUDA Utilities <u>https://cis.gvsu.edu/~wolffe/courses/cs677/projects/tutorial\_CUDA-</u> <u>utilities.html</u>





### CUDA Best Practices

- Find ways to parallelize sequential code,
- Minimize data transfers between the host and the device,
- Adjust kernel launch configuration to maximize device utilization,
- Ensure global memory accesses are coalesced,
- Minimize redundant accesses to global memory whenever possible,
- Ref: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

Avoid long sequences of diverged execution by threads within the same warp.



# Rainbow table/hash generator on GPU(1)

Name Advanced RT Gen

md5

Hash

Cryptohaze GPU **Rainbow Cracker**  MD4, MD5, NTL M,SHA1

Rainbow Crackalack

NTLM

md5

Open

OSS

Open

Closed

md5-rainbow-tablegen-opencl

RainbowCrack

LM, NTLM, MD5, SHA1, SHA256 ...

Open

Closed







### Notes

GPU is only beta tested

Full GPU acceleration for table generation (incl. reduction function) and cracking

Well documented

**Functionalities** 

RT generator

RT generator, merge, indexing, lookup

RT generator, lookup

Only PoC

**RT** generator

GPU supported in rainbow table RT generator, merge, lookup but not generation indexing, lookup



### Rainbow table/hash generator on GPU(2)

Recommended Reading

- Steven Meyer EPFL, Breaking 53 bits passwords with Rainbow tables using GPUs https://docplayer.net/50461249-Breaking-53-bits-passwords-withrainbow-tables-using-gpus.html
- Russell Edward Graves, Iowa State University, High performance password cracking by implementing rainbow tables https://lib.dr.iastate.edu/cgi/ viewcontent.cgi?article=2860&context=etd



### Rainbow table/hash generator on GPU(2)

Recommended references

- Rainbow2 Rainbow tables using Nvidia GPU: https://github.com/voyager23/ Rainbow2
- NTHashTickler\_CUDA. NT hash bruteforcer on CUDA. https://github.com/ ryanries/NTHashTickler\_CUDA



# Assignments Installation of CUDA Toolkit Parallelise hash computing



References
- CUDA C/C++ Basics NVIDIA
- ACMS 40212/60212: Advance

### - ACMS 40212/60212: Advanced Scientific Computing, U.ND.

