Graphics Programming and Processing

Graphics have always been a complete black box to me, and I would venture to guess that most programmers feel the same way. I attribute this to the fact that graphics are often only accessible through several layers of (usually leaky) abstraction. For example, most web developers have optimized an application by avoiding repaints or utilizing hardware acceleration during animations. But what does this actually mean, under the hood? How do we interface with graphics hardware as programmers? How do pixels end up on the screen?

I doubt that I can even scratch the surface on these questions, not to mention the questions they lead to, in this post. So, my goal is only to build a simplified mental model of graphics programming and processing to contextualize future learnings.

OpenGL and Vulkan

OpenGL and Vulkan are APIs that abstract the GPU from the application programmer. OpenGL is no longer in active development, as it has been succeeded by Vulkan. Neither impose a specific implementation. They’re just APIs after all. Some high level differences between the two include:

As the successor, Vulkan supports newer hardware capabilities, such as raytracing.
OpenGL is older, so it is more supported. At the time of this writing, Vulkan is not directly implemented in MacOS because Apple unfortunately chose to invest in their own low level graphics API, Metal. There is an open source project that translates Vulkan to Metal called MoltenVK, which does support an impressive surface area of the Vulkan API. But, it isn’t perfect or vendor supported, which feels bad for graphics programming.
Vulkan is much lower level than OpenGL. The best example I could find of this difference is that Vulkan features explicit memory management, while OpenGL does not.
OpenGL holds a singleton state machine, where each call can mutate some global state. The Vulkan API is object-based, although the available “classes” for those objects are already well defined by the spec.
Vulkan supports multithreading, including synchronization semantics. OpenGL is designed to be controlled by a single CPU thread, and, conceptually, operations are meant to execute in order.
OpenGL shaders are written in GLSL, while Vulkan uses a format for shaders called SPIR-V, which allows shaders to be pre-compiled. Technically, this means Vulkan supports any shader language, as long as a compiler to SPIR exists.

I will probably experiment with both OpenGL and Vulkan, but I’m starting with OpenGL simply because a) there are some incredible resources available for OpenGL and b) it is simpler.

For reference, OpenGL code looks like this:

unsigned int id;

// Allocate buffer
glGenBuffers(1, &id);

// Subsequent operations will reference the bound buffer
glBindBuffer(GL_ARRAY_BUFFER, id);

// Load data into buffer
glBufferData(GL_ARRAY_BUFFER, sizeof(data) / sizeof(float), data, GL_STATIC_DRAW);

// Specify shape of buffered data
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 3 * sizeof(float), (void*)0);

// Draw buffered data
glDrawArrays(GL_TRIANGLES, 0, sizeof(data) / sizeof(float));

Shaders

Most gamers and 3D artists have encountered the word “shader”, perhaps in a loading screen or modeling software menu. A shader is a program that runs on the GPU. Shaders are written using languages specified by the graphics API and run in a limited execution environment.

OpenGL defines a language called GLSL that is used to write shaders. Here’s an example of a simple GLSL shader:

// Language version specification
#version 330 core

// Input variable and type, unique to each parallel execution
layout (location = 0) in vec3 coord;

// Input variable shared by parallel executions
uniform mat4 transform;

// Program body that executes on each core, in parallel
void main()
{
    gl_Position = transform * vec4(coord, 1.0);
}

In OpenGL, shader programs are compiled and linked at runtime by the graphics driver. The shader program can then be “used” by application code, which sets the current shader program on the OpenGL context. Subsequent draw commands will be processed by this shader program until a different one is used.

The Graphics Pipeline

Conceptually, graphics are processed in a pipeline, where each step produces a result from the output of the previous step. There’s a seriously fantastic blog series from 2011 on this topic, but the gist is that commands and data from application API calls are sent to the GPU, then processed in a series of steps to produce the pixel values that appear on the screen. Some steps of this pipeline are programmable; that’s where shaders come in. To grossly oversimplify how graphics are rendered:

Applications call into a graphics API to issue draw commands, load vertex data, or compile shaders (OpenGL shaders are compiled and optimized by the graphics driver).
API calls are sent to the driver, which stores them in memory as commands in the command buffer. This buffer is sent to the GPU, which executes them in order.
The vertex shader receives vertex data (points in space and associated metadata). The vertex shader has the opportunity to transform these vertices individually, usually by converting them to the desired coordinate space.
The tessellation stage creates primitives, such as points, lines, and triangles, from individual vertices.
The geometry shader has the opportunity to transform these primitives individually, e.g. translating triangles outward to explode an object.
The rasterization stage converts primitives, which are conceptually groups of points in space, to a 2 dimensional buffer of pixel values.
The fragment, or pixel, shader then determines the color for each pixel, which can be useful for implementing lighting, mapping textures, etc.
Results are written to the frame buffer, which is a reserved portion of video memory. When the frame has finished rendering, the frame buffer can be swapped to the display device.

This process is implemented using a combination of hardware and software, and is obviously specific to the design of the graphics card. The pipeline doesn’t execute synchronously - it’s massively parallelized in hardware to achieve necessary throughput.

Frame Buffer Techniques

As the graphics pipeline is producing pixel values and writing them to video memory, the frame buffer is concurrently being read by the display device. Therefore, we have a synchronization problem — we need a way to ensure that the frames we are rendering are displayed in a complete state, one at a time, to prevent visual glitches.

Rendering to a frame that is staged in memory, then swapped to the display is called “double buffering”, and it is used to prevent partial frames from making it to the screen. There is also a higher level abstraction known as a swap chain, where more than one staged frame (known as a “back buffer”) can be used. A swap chain is simply a generalization of the double buffering technique, where double buffering can be implemented as a swap chain that uses only two buffers.

At most, each new front buffer is swapped in greedily, as soon as each frame finishes rendering. However, this does not guarantee synchronization with the display device itself; if the buffer is swapped during a refresh, a visual artifact called “screen tearing” can happen, where the top and bottom of the screen are painted from two different frames. To prevent this, the buffer can be swapped only when the display device finishes painting, which reduces tearing but limits frame rate. In game settings, this rendering mode is commonly known as “V-Sync”.

Graphics Cards and GPUs

A graphics card is basically just a miniature computer designed for high throughput computations. It comes complete with a motherboard, onboard memory, and a processor (the GPU). Anyone who has built a PC probably knows that your graphics card fits into the PCIe slot on the main system motherboard. Graphics cards use this interface to communicate bi-directionally with the rest of the system, which allows the GPU to access some reserved system memory and the CPU to access GPU registers and video memory. Onboard memory access is high bandwidth, but also high latency. Said differently, the memory architectures used to build graphics cards are designed for infrequent transfers of large blocks of data.

GPUs themselves are essentially giant arrays of low powered (and low clock speed) CPU cores. It’s as if someone wrote a program, spun up a few thousand threads to process some data, then said, “Hey, instead of scheduling these threads in software and contending for CPU resources, what if we implemented them in hardware?” Each core is complete with an ALU capable of floating point and integer arithmetic, pipelined execution, and processor caches. GPUs are designed under the SIMD processing model, which means “Single Instruction, Multiple Data”. In other words, all cores run the same program while processing different streams of data.

This architecture allows graphics cards to efficiently process high throughput, embarrassingly parallel workloads. And, as long as pipeline stalls and slow data accesses are reduced (i.e. the GPU stays busy), each frame can be processed within a predictable window of time, which is exactly what you want if you’re trying to achieve a certain frame rate.