Homogeneous Coordinates and Trivial Matrix Transformations

Imagine a 5 second clip from your favorite game. Maybe you’re holding a corner to block enemy players from their objective. Or, maybe you’re locked onto an enemy and dodge-rolling away from an attack. Maybe you’re spinning the camera around your character to admire some cool looking armor you spent the last few hours grinding for.

In all of these cases, for every visible object in the scene, your console or PC is deciding where to position every vertex to accurately simulate 3D space, in real time. For example, consider a vertex along the branch of a tree that your character is walking past. The game engine loads a 3D model file that defines the tree’s shape in the form of vertices, relative to the center of the model itself. Then, the engine decides where to place that vertex, relative to the origin of the scene, based on the tree’s position in the world. Finally, the engine computes how to position the vertex relative to the camera, or the player’s view into the world. As the camera moves and the branch sways in the wind, these calculations are performed many times per second to give the illusion of movement.

Clearly, transforming the position of a vertex is a fundamental operation in 3D graphics. Suppose we had a unified mathematical framework that we could use to perform these transformations. If we also found a way to execute this math really efficiently on hardware, then we would have everything we need to simulate a 3D scene in real time.

Linear Algebra

This is where linear algebra comes in. The bread and butter of linear algebra are matrices and vectors, and it describes how they can be used to perform transformations. The two transformations that I consider to be “trivial” are scaling and translation; other transformations, such as rotation, can also be described trivially, but their implementation is actually non-trivial because of practical limitations. Therefore, I’ll just focus on scaling and translation in this post.

Before actually transforming a vertex, we need to model it in terms that linear algebra understands. We are already familiar with Euclidean space and Cartesian coordinates — the typical \((x,y,z)\) representation of a point in 3D space. As a vector:

\[\begin{bmatrix} x\\ y\\ z\\ \end{bmatrix}\]

Aside from introducing some new notation, by representing vertices as vectors, we have also introduced a subtle change in the way we think about them. By definition, a vector is a magnitude and a direction. So, instead of thinking about vertices as points in 3D space, we can now think about them as movements from an origin along each axis.

Scaling and Translation

Let’s start with scaling. What does it mean to scale a vector? Scaling increases the magnitude of a vector without changing its direction. Or, we can think about moving a vertex closer or further away from the origin along a direct path. This isn’t too hard to figure out — we can simply multiply each component by the same scalar value. The vector components maintain their proportion to one another, so the direction stays the same, but the magnitude (or distance from the origin) grows or shrinks. For example:

\[2* \begin{bmatrix} 3\\ 1\\ 2\\ \end{bmatrix} = \begin{bmatrix} 6\\ 2\\ 4\\ \end{bmatrix}\]

Now that we have a way to scale a vector, how would we translate it? This is also pretty easy to figure out. Instead of multiplying by a scalar value, we could just add different values to each vector component to move it up, down, left, right, backward and forward. For example:

\[\begin{bmatrix} 3\\ 1\\ 2\\ \end{bmatrix} + \begin{bmatrix} 1\\ 0\\ 3\\ \end{bmatrix} = \begin{bmatrix} 4\\ 1\\ 5\\ \end{bmatrix}\]

A Better Way

In practice, we almost never apply just a single transformation to a vertex. Instead, we are applying a series of transformations to compute the position of each vertex in various coordinate spaces. For example, displaying a vertex from the camera’s perspective involves translations to local space, world space, and finally view space.

A naive implementation of scaling and translation as scalar multiplication and vector addition would require a sequential operation for each transformation. Therefore, the number of computations we need to perform for each vertex would grow with the number of transformations, which sets an upper bound on the number of transformations we can execute performantly. Ideally, we would be able to transform a vertex in constant time, regardless of the number of transformations we need to apply.

Luckily, linear algebra tells us we can combine a series of transformations into a single matrix using matrix multiplication. Let’s say we want to perform a series of 10 transformations on every vertex. Using matrix multiplication, we can construct a single matrix representing those 10 transformations, pass it to the GPU, and multiply each vertex by the transformation matrix once. Sure, we still need to construct the transformation matrix in the first place, which will take time, but we only need to do this once per frame on the CPU. The GPU doesn’t need to know how to perform specific transformations or the order in which to apply them. It just needs to be really good at parallelizing matrix multiplication.

Scaling and Translation, Revisited

How would we perform the same scaling and translation calculations using matrix multiplication? Scaling is pretty straightforward:

\[2* \begin{bmatrix} 3\\ 1\\ 2\\ \end{bmatrix} = \begin{bmatrix} 3*2\\ 1*2\\ 2*2\\ \end{bmatrix} = \begin{bmatrix} 2 & 0 & 0\\ 0 & 2 & 0\\ 0 & 0 & 2\\ \end{bmatrix} * \begin{bmatrix} 3\\ 1\\ 2\\ \end{bmatrix}\]

We’ve simply taken the identity matrix and multiplied it by a scaling constant to form a scaling transformation matrix. That matrix can be used to scale any 3D vector by a factor of 2.

Translation is a bit more difficult to figure out, but still fairly easy to follow once you know the trick:

\[\begin{bmatrix} 3\\ 1\\ 2\\ \end{bmatrix} + \begin{bmatrix} 1\\ 0\\ 3\\ \end{bmatrix} = \begin{bmatrix} 3+1\\ 1+0\\ 2+3\\ \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 1\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 3\\ 0 & 0 & 0 & 1\\ \end{bmatrix} * \begin{bmatrix} 3\\ 1\\ 2\\ 1\\ \end{bmatrix}\]

Wait, what just happened? We are suddenly dealing with 4-dimensional matrices and vectors, and the extra dimension exists only to store a 1 as the last element. And, the right side of the matrix has been replaced by the translation values. The reason for this is obvious if you work this example out on paper - we need the extra 1 to represent vector addition as matrix multiplication. Put differently, we need to use a higher dimensional vector space to represent translation as matrix multiplication.

This is why we use 4-dimensional vectors and matrices for 3D graphics processing! It’s important to note that we aren’t talking about the 4th spatial dimension. The final element is just some metadata used to describe the vector, which helps us with some math (like translation).

NOTE: This also applies to the scaling matrix. Although we can scale using 3D matrices only, as shown above, we actually use 4D matrices so that they can be multiplied with other transformations, like this:

\[\begin{bmatrix} 2 & 0 & 0 & 0\\ 0 & 2 & 0 & 0\\ 0 & 0 & 2 & 0\\ 0 & 0 & 0 & 1\\ \end{bmatrix}\]

Homogeneous Coordinates

When we add this 4th dimension to a 3D vectors, we call them homogeneous coordinates. The general idea is that the X, Y, and Z coordinates of a vector are interpreted by dividing by the final element, W. This extra information gives us a couple of useful properties:

If you multiply a homogeneous coordinate by a scalar constant, the vector remains unchanged. In other words, \((1,4,2,1)\) is equivalent to \((2,8,4,2)\).
We can now represent infinity. If we set W to zero, we get a vector that points in some direction, but has infinite (or undefined) magnitude.

For vectors representing positions, we use 1 as the final vector component. For directions, we use 0. This is useful for a few reasons, but here is one example that I found to be easiest to understand.

Position vs. Direction

An example of a positional vector is the location of a vertex, a finite point in space. An example of a directional vector is the normal of a face in a 3D model, or the direction perpendicular to the face. The normal could be accurately placed at any point on that face — it conceptually doesn’t have a position. When expressing the normal in homogeneous coordinates, we set W to 0. Why is this important?

Let’s think about diffuse lighting from a light source. To calculate the diffuse brightness, we need to find the angle that a surface reflects the light, which we can do by using the dot product between:

The direction from the surface to the light source
The direction of the surface’s normal

But first, let’s assume our camera has been translated 10 units to the negative X direction, which means we first need to translate all of our vectors in the positive X direction. Now, imagine we mistakenly use 1 instead of 0 as W component of the normal vector, even though it represents a direction, not a position. What will happen?

Let’s say the normal vector is, \((0,1,0)\), pointing directly along the Y-axis. We will arbitrarily put the light source at \((5,5,5)\) and the surface fragment at \((1,2,1)\). Let’s do the math to translate everything into view space, then try to reason about the results.

\[\displaylines{ \begin{bmatrix} 5\\ 5\\ 5\\ 1\\ \end{bmatrix} - \begin{bmatrix} 1\\ 2\\ 1\\ 1\\ \end{bmatrix} = \begin{bmatrix} 4\\ 3\\ 4\\ 0\\ \end{bmatrix} \\ \begin{bmatrix} 1 & 0 & 0 & 10\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1\\ \end{bmatrix} * \begin{bmatrix} 0\\ 1\\ 0\\ 1\\ \end{bmatrix} = \begin{bmatrix} 10\\ 1\\ 0\\ 1\\ \end{bmatrix} \\ \begin{bmatrix} 4\\ 3\\ 4\\ 0\\ \end{bmatrix} \cdot \begin{bmatrix} 10\\ 1\\ 0\\ 1\\ \end{bmatrix} \propto cos(\theta) }\]

We use homogenous coordinates by adding a W component to each coordinate. Notice that we correctly set W to 1 for the light’s and surface fragment’s position, but we accidentally used 1 for the normal vector’s W component as well. Because of this, the X component of the normal vector is translated by 10, and the dot product produces the wrong angle.

How would this affect a 3D scene, visually? Remember that the translation by 10 comes from the camera’s location. Therefore, this mistake produces a visual bug where the surface gets brighter and darker as the camera moves around! As an example, I replicated this bug in my own engine:

Camera Facing Crate, Strafing

Oof, that doesn’t look quite right — the diffuse brightness of the surface should remain the same regardless of the camera’s position. Let’s look at how using 0 for directional vectors changes things.

\[\displaylines{ \begin{bmatrix} 5\\ 5\\ 5\\ 1\\ \end{bmatrix} - \begin{bmatrix} 1\\ 2\\ 1\\ 1\\ \end{bmatrix} = \begin{bmatrix} 4\\ 3\\ 4\\ 0\\ \end{bmatrix} \\ \begin{bmatrix} 1 & 0 & 0 & 10\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1\\ \end{bmatrix} * \begin{bmatrix} 0\\ 1\\ 0\\ 0\\ \end{bmatrix} = \begin{bmatrix} 0\\ 1\\ 0\\ 0\\ \end{bmatrix} \\ \begin{bmatrix} 4\\ 3\\ 4\\ 0\\ \end{bmatrix} \cdot \begin{bmatrix} 0\\ 1\\ 0\\ 0\\ \end{bmatrix} \propto cos(\theta) }\]

We’ve fixed the bug! The normal vector is completely unaffected by the translation transformation, so the brightness of the surface is no longer affected by the camera’s position. This is why homogeneous coordinates are important - setting a vector’s W component to 0 gives it an undefined magnitude, effectively making it immune to translation.