Open standard for parallel programming


For reasons of time and expertise, we will:

  • skip a few things, especially related to images
  • avoid using OpenCL for host allocations
  • only cover OpenCL 1.2

OpenCL implementations

An API for programming multiple devices

  • Intel CPUs
  • AMD
  • Intel Ivy Bridge GPUs
  • Altera FPGA

Useful to keep in min the GPU example


An OpenCL application can make use of many devices at once. Each vendor provides a platform

A main DLL (the “ICD loader”) acts as dispatcher

Under /etc/OpenCL/vendors/*.icd are listed the various ICD (installable client drivers)

Each ICD file contains the path to the actual vendor-supplied DLL

Platform model

For each platform one can have multiple devices (e.g. 2 different NVIDIA GPUs).

Each device, in turn, has many cores (processing elements) distributed in compute units.

See info example

Compilation model

Compilation is done programmatically

Each platform has its own compiler

The developer box may not have the platform at all!

Hence compilation is done on demand via API

The result is a program object

From that, one extracts kernels by name

Compilation workflow

Platform 🡒


Devices 🡒


Context 🡒


Program 🡒

(buildProgram, createKernel)


See compile example

The device side

OpenCL C

Kernels are written in a dialect of C

__kernel void add_vector(__global float* a, __global float* b,
  __global float* c, int num_els) {
  int idx = get_global_id(0);
  if (idx < num_els) {
    c[idx] = a[idx] + b[idx];

OpenCL C limitations

  • No recursion
  • No VLA or variadic functions
  • Kernel function args are restricted (e.g. no double pointers)
  • Limited library functions

OpenCL C additions

  • Memory is in a hierarchy (private, local, global, constant)
  • Additional data types (images, events)
  • Additional functions to detect the kernel position in a hierarchy

Execution model

Kernels are executed on compute units.

Think of compute units as cores, each one executing multiple work items.

Work-items executed on the same compute unit belong to a work group, and support some forms of communication and synchronization.

Memory model

Device memory forms a hierarchy:

  • Global memory is shared on the device (think RAM)
  • Local memory is shared in a work-group (think cache)
  • Private memory is for a single work-item (think registers)
  • Constant memory is a read-only (in kernels) area of the global memory

Constant and global memory can be read and written by the host, while the other ones are not visible.

Example revisited

The check on get_global_id(0) amounts to the termination condition on a loop

__kernel void add_vector(__global float* a, __global float* b,
  __global float* c, int num_els) {
  int idx = get_global_id(0);
  if (idx < num_els) {
    c[idx] = a[idx] + b[idx];

The host side

Buffers and memory

All memory on the device is manually managed.

Buffers on the device can be created using createBuffer.

It support various flags to:

  • mark the memory as read-only or write-only (from kernels)
  • fill the memory by copying from the host
  • reuse host memory that is visible from the device

Commands and queues

All commands are submitted to the device using command queues, which can be in-order or out-of-order (for async computations).

Out of order queues require synchronization through events which we will see later.

Queues can be used to:

  • copy memory back and forth
  • submit kernels
  • insert barriers between kernels


Kernels submission works in two steps:

  • Bind the kernel to the arguments using clSetKernelArg
  • Submit the kernel to a queue using clEnqueueNDRangeKernel

The work space is assumed to be divided in 1-d, 2-d or 3-d regions, hence work-groups, work-items and buffers are arranged using the corresponding dimension

See vector add example

Barriers and fences

Using the barrier function one can synchronize work-items belonging to the same work-group.

There are some flags that determine whether memory modifications have been flushed:

  • CLK_LOCAL_MEM_FENCE means modifications to local memory are visible
  • CLK_GLOBAL_MEM_FENCE means modifications to global memory are visible

There is no synchronization among work-groups in the same kernel.


Operations that fold over a sequence are naturally serial

If the operation is associative, one can

  • parallelize it over chunks of the sequence, then
  • put together the partial results in a single thread

Reducing each chunk on a work-group allows to use barriers

Parallel add example

Implementing this for addition on a vector gives the following reduction strategy

Exploiting commutativity

When the operation is commutative, the reduction can be grouped in a different way

This yields consecutive memory strides, hence more efficient use of SIMD operations.

Vector reduction example

// `vec` is a chunk of a vector in local memory
// local size should be a power of 2
int idx = get_local_id(0);
for(int s = get_local_size(0) / 2; s > 0; s = s / 2) {
  if (idx < s) {
    vec[idx] += vec[idx + s];

See K-Means example


Events are objects that have various states: queued, submitted, ready, running, ended and complete.

Every operation submitted on a queue can

  • complete events, or
  • block on events

This allows to synchronize operations on different queues, or on out-of-order queues

Creating events

Events can be generated:

  • by host using clCreateUserEvent
  • by the device, providing an event pointer to clEnqueueWriteBuffer, clEnqueueNDRangeKernel and similar
  • by queues using clEnqueueMarkerWithWaitList

Waiting for events

Various objects can block and wait for event completion:

  • host, using clWaitForEvents
  • kernels, providing event lists to clEnqueueNDRangeKernel
  • queues, using clEnqueueBarrierWithWaitList

Also, one can attach callbacks on the host to the event completion

This allows to write various synchronization patterns

More topics

You may also want to look at

  • image buffers
  • kernel pipes (only OpenCL 2)
  • atomic operations
  • device-side queues (only OpenCL 2)
  • shared virtual memory (only OpenCL 2)
  • host pinned memory