The JuliaGPU Ecosystem¶

by: Valentin Churavy (Github and Twitter: @vchuravy)

Slides: https://vchuravy.github.io/notebooks/juliacon2017/GPU-workshop.slides.html
Notebook: https://vchuravy.github.io/notebooks/juliacon2017/GPU-workshop.ipynb
Talk as one page: https://vchuravy.github.io/notebooks/juliacon2017/GPU-workshop.html

What are we going to talk about?¶

Why are we interested in GPUs
How can we work with GPUs in Julia
1. Working with the APIs
2. FFT and BLAS
GPUArrays.jl (by Simon Danish)
CUDAnative.jl (by Tim Maleadt)

Why are we interested in GPUs¶

A GPU is a Graphics Processing Unit they were initially developed to accelerating rendering of graphically intensive appliacations (e.g games and CAD). They are massively parallel computing devices and they sit in everyones computers. Sold by Intel, Nvidia, and AMD. There architecture is optimised to perfome pipelined per-pixel operations in a SIMD model.

Available in many desktops and HPC environments
Highly performant for certain tasks

GPGPU¶

There exists speciailised programming languages to program GPUs. For graphics we call these programs shaders and for general purpose programs we call them kernels.

2003 Brook to translate C-like code into a shader language
2006 Initial release of CUDA
2009 Initial release of OpenCL

What are GPUs good for¶

Image processing
Convolutional Deep Learning
Matrix operations
FFT
...

The other side of the coins¶

Memory transfer cost
Program needs to fit programming model
There are no magic speedups
Memory is limited
SIMD programming model --> warp divergence

How can we work with GPUs in Julia?¶

Access to the APIs¶

There are two major APIs to access device resources and start kernels. The open standard OpenCL and Nvidia's CUDA. In Julia we support both APIs through packages in the JuliaGPU organisation.

OpenCL.jl https://github.com/JuliaGPU/OpenCL.jl
CUDAdrv.jl https://github.com/JuliaGPU/CUDAdrv.jl
CUDArt.jl https://github.com/JuliaGPU/CUDArt.jl

I am going to use primarily OpenCL in this notebook, since OpenCL can also run on CPUs.

OpenCL Example I¶

In [3]:

using OpenCL

In [4]:

device, ctx, queue = cl.create_compute_context();

We will need to define a OpenCL C kernel to execute on the compute device.

In [5]:

kernel = """
   __kernel void sum(__global const float *a,
                     __global const float *b,
                     __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = a[gid] + b[gid];
    }
""";

In [6]:

p = cl.Program(ctx, source=kernel) |> cl.build!
k = cl.Kernel(p, "sum")

Out[6]:

OpenCL.Kernel("sum" nargs=3)

OpenCL Example II¶

In [7]:

N = 500_000

Out[7]:

In [8]:

a = rand(Float32, N);
b = rand(Float32, N);
c = zeros(Float32, N);

In [9]:

# Move data to device
a_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=a)
b_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=b)
c_buff = cl.Buffer(Float32, ctx, :w, length(a))

Out[9]:

Buffer{Float32}(@0x000000000471e3e0)

In [10]:

# execute kernel
queue(k, size(a), nothing, a_buff, b_buff, c_buff)
r = cl.read(queue, c_buff);

In [11]:

norm(r - (a+b)) ≈ zero(Float32)

Out[11]:

true

Benchmarking¶

Most GPU code is executed asynchronously so benchmarking takes a bit of care. If you see benchmarks make sure that the account for overhead (copying data back and forth and actually running the code.) OpenCL.jl currently blocks on many high-level operations.

In [12]:

@benchmark begin
    evt = queue($k, size(a), nothing, $a_buff, $b_buff, $c_buff)
end

Out[12]:

BenchmarkTools.Trial: 
  memory estimate:  1.00 KiB
  allocs estimate:  34
  --------------
  minimum time:     708.500 μs (0.00% GC)
  median time:      784.849 μs (0.00% GC)
  mean time:        804.709 μs (0.00% GC)
  maximum time:     3.166 ms (0.00% GC)
  --------------
  samples:          6045
  evals/sample:     1

In [13]:

@benchmark broadcast!(+, $c, $a, $b)

Out[13]:

BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     579.994 μs (0.00% GC)
  median time:      722.409 μs (0.00% GC)
  mean time:        778.742 μs (0.00% GC)
  maximum time:     2.523 ms (0.00% GC)
  --------------
  samples:          6113
  evals/sample:     1

If you don't already have code in CUDA C or OpenCL C you probably don't need to/want to use these APIs directly.

BLAS and FFT¶

There are libraries available for BLAS and FFT based upon CUFFT, CUBLAS, CLFFT, CLBLAS at https://github.com/JuliaGPU these libraries have varying level of support.

Let's calculate an FFT!

In [14]:

import CLFFT
const clfft = CLFFT;

In [15]:

N = 100
X = ones(Complex64, N)
bufX = cl.Buffer(Complex64, ctx, :copy, hostbuf=X)

p = clfft.Plan(Complex64, ctx, size(X))
clfft.set_layout!(p, :interleaved, :interleaved)
clfft.set_result!(p, :inplace)
clfft.bake!(p, queue)

clfft.enqueue_transform(p, :forward, [queue], bufX, nothing)  
result = cl.read(queue, bufX)

@assert isapprox(norm(result - fft(X)), zero(Float32))

Thanks to all the JuliaGPU contributors:¶

(in no particular order)

@maleadt
@SimonDanisch
@jakebolewski
@kshyatt
@dfdx
@timholy
@ranjan
@denizyuret
@ekobir
@lindahua
@grollinger

Upcoming GPUArrays.jl!¶

GPUArrays.jl solves the problem that depending on your GPU you might want to use different libraries and abstracts away some of the complexity.