by: Valentin Churavy (Github and Twitter: @vchuravy)
A GPU is a Graphics Processing Unit they were initially developed to accelerating rendering of graphically intensive appliacations (e.g games and CAD). They are massively parallel computing devices and they sit in everyones computers. Sold by Intel, Nvidia, and AMD. There architecture is optimised to perfome pipelined per-pixel operations in a SIMD model.
There exists speciailised programming languages to program GPUs. For graphics we call these programs shaders and for general purpose programs we call them kernels.
There are two major APIs to access device resources and start kernels. The open standard OpenCL and Nvidia's CUDA. In Julia we support both APIs through packages in the JuliaGPU organisation.
I am going to use primarily OpenCL in this notebook, since OpenCL can also run on CPUs.
using OpenCL
device, ctx, queue = cl.create_compute_context();
We will need to define a OpenCL C kernel to execute on the compute device.
kernel = """
__kernel void sum(__global const float *a,
__global const float *b,
__global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
""";
p = cl.Program(ctx, source=kernel) |> cl.build!
k = cl.Kernel(p, "sum")
OpenCL.Kernel("sum" nargs=3)
N = 500_000
500000
a = rand(Float32, N);
b = rand(Float32, N);
c = zeros(Float32, N);
# Move data to device
a_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=a)
b_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=b)
c_buff = cl.Buffer(Float32, ctx, :w, length(a))
Buffer{Float32}(@0x000000000471e3e0)
# execute kernel
queue(k, size(a), nothing, a_buff, b_buff, c_buff)
r = cl.read(queue, c_buff);
norm(r - (a+b)) ≈ zero(Float32)
true
Most GPU code is executed asynchronously so benchmarking takes a bit of care. If you see benchmarks make sure that the account for overhead (copying data back and forth and actually running the code.) OpenCL.jl currently blocks on many high-level operations.
@benchmark begin
evt = queue($k, size(a), nothing, $a_buff, $b_buff, $c_buff)
end
BenchmarkTools.Trial: memory estimate: 1.00 KiB allocs estimate: 34 -------------- minimum time: 708.500 μs (0.00% GC) median time: 784.849 μs (0.00% GC) mean time: 804.709 μs (0.00% GC) maximum time: 3.166 ms (0.00% GC) -------------- samples: 6045 evals/sample: 1
@benchmark broadcast!(+, $c, $a, $b)
BenchmarkTools.Trial: memory estimate: 32 bytes allocs estimate: 1 -------------- minimum time: 579.994 μs (0.00% GC) median time: 722.409 μs (0.00% GC) mean time: 778.742 μs (0.00% GC) maximum time: 2.523 ms (0.00% GC) -------------- samples: 6113 evals/sample: 1
If you don't already have code in CUDA C or OpenCL C you probably don't need to/want to use these APIs directly.
There are libraries available for BLAS and FFT based upon CUFFT, CUBLAS, CLFFT, CLBLAS at https://github.com/JuliaGPU these libraries have varying level of support.
Let's calculate an FFT!
import CLFFT
const clfft = CLFFT;
N = 100
X = ones(Complex64, N)
bufX = cl.Buffer(Complex64, ctx, :copy, hostbuf=X)
p = clfft.Plan(Complex64, ctx, size(X))
clfft.set_layout!(p, :interleaved, :interleaved)
clfft.set_result!(p, :inplace)
clfft.bake!(p, queue)
clfft.enqueue_transform(p, :forward, [queue], bufX, nothing)
result = cl.read(queue, bufX)
@assert isapprox(norm(result - fft(X)), zero(Float32))
(in no particular order)
GPUArrays.jl solves the problem that depending on your GPU you might want to use different libraries and abstracts away some of the complexity.