Voxel Engine – More Speed!

Back in my last post I analyzed in fairly decent depth the performance of a run based voxel engine compared to the more traditional approach. However, since then I had a couple of ideas on how to optimize the traditional approach and so I quickly set about implementing them. In this post I intend to describe what the changes were and discuss their performance implications.

Adjacency Table

The idea for this optimization came to me when reading Michael Abrash’s incredible “Graphics Programming Black Book”, which can be found online here. The inspiration came from something discussed at the end of chapter 17, which discusses optimisations that can be made to the game of life.

For those who don’t know, the game of life is a simple cellular automaton consisting of a 2D orthogonal grid, in which each cell can be either alive or dead. Each generation of the game the following rules are applied too all the cells:

  1. Any live cell with less than 2 live cells in its 8 cell neighbourhood dies
  2. Any live cell with 2 or 3 live neighbours survives
  3. Any live cell with more than 3 live neighbours dies
  4. Any dead cell with exactly 3 neighbours becomes alive

Counting the number of neighbours for each cell every generation is rather costly, however, Michael proposes a solution to this. As cells change state relatively infrequently it is advantageous for each cell to keep track of the number of adjacent cells. Whenever a cell changes state it updates the counts for the cells in its neighbourhood. As roughly only 1 tenth of the cells update each generation, we have to on average only do a tenth of the work. In other words the extra cost of updating a cell is outweighed by the lower cost of iteration.

Meshing is actually a very similar situation: we have a large amount of voxels which must visit their 6 adjacent voxels, to determine whether to add a face. The data updates relatively infrequently and therefore a similar system can be used. The system I implemented was for every voxel to have an additional byte of data, of which 6 bits are used to encode the adjacent faces that need to be rendered. 1 is used to represent a face to be rendered, so that voxels with no faces can be skipped over with a 0 comparison, which is normally faster than comparison to any other value. Using the same tests as described in the previous article, just this method alone was enough to roughly halve the meshing time in all 3 cases.

The next optimisation was to make air voxels have an adjacency byte of 0, meaning we can quickly skip over air voxels and voxels with no faces just by checking the adjacency byte. This change roughly halved the meshing times again to 2.21 seconds for the Sin(Z) heightmap.

The final optimisation to this method I made was to batch retrievals. Whilst this is already done somewhat by the CPU cache, with most processors retrieving 64 bytes at once, doing it more explicitly allows for faster skipping. As I am using flat 1D arrays, this was very easy to implement: I simply used a pointer to a larger datatype as the iterator. The algorithm checks the value pointed to by the iterator is not 0 and if so processes each byte separately, checking for 0 equality before checking the set bits. Whilst I did try further subdivisions, e.g 1 32-bit check followed by two 16-bit checks, this performed really poorly. In order to decide the correct iterator size I ran the 3 performance tests using a 16-bit, a 32-bit and a 64-bit iterator. The results are below:

HeightMap 16-bit iterator 32-bit iterator 64-bit iterator
Flat 1.281 Seconds 0.906 Seconds 0.703 Seconds
Sin(Z) 2.016 Seconds 1.359 Seconds 1.235 Seconds
Sin(X) 1.844 Seconds 1.391 Seconds 1.609 Seconds

Based on these results I used the 32-bit iterator, as it gave the most consistently good results.

The big problem with this approach, however, is its memory footprint: At 32KB per chunk it is hardly small. However, if the near 8 times improvement in meshing speed is not enough to justify it, there are numerous algorithms other than meshing that can make use of the table. Most importantly for me was its facility for cellular automata based simulation: fluids, for example, can quickly determine where to flow to.

The adjacency table also means each chunk has all the information it needs to generate its MegaVoxels and so we only need to load into memory the chunks that are visible. When the chunks are loaded into memory they can quickly regenerate their adjacency table, to account for any changes in adjacent chunks. Without this system we have to load a border of chunks that aren’t rendered.

One other great benefit of this adjacency table is we can quickly count the number of set bits, and therefore faces to render, using the following code:

int32_t* it = (int32_t*)&(voxelAdjacency[chunkY * 4096]);
int32_t* end = (int32_t*)&(voxelAdjacency[(chunkY + 1) * 4096]);
int32_t v = 0;
int32_t c = 0;
for (it; it != end; ++it) {
++v = *it;
++v = v - ((v >> 1) & 0x55555555);
++v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
++c += ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
return c;

Credit to this site where I found the sideways bit counting algorithm.

This is useful as it allows one to allocate hardware vertex and index buffers and then stream data directly to them during meshing, rather than copying the data across afterwards.

Heightmap

Whilst in a previous article I dismissed using a plain heightmap as having little performance benefit, I came up with a variation of it which gave a significant performance boost.

As terrain is generally fairly flat, often only a few of a Chunk’s MegaVoxels have faces and then these only have faces between certain Y values. The optimisation is therefore simply to store these minimum and maximum Y values, restricting the bounds the meshing algorithm has to iterate over. As the map is 128 voxels high and each MegaVoxel is 16 voxels high, these values can all fit into a single 64-bit value, with each Y value encoded as 4 bits. A table of the result can be found below:

HeightMap Original Algorithm Adjacency Table Adjacency Table + Heightmap
Flat 6.656 Seconds 0.906 Seconds 0.391 Seconds
Sin(Z) 9.688 Seconds 1.359 Seconds 1.016 Seconds
Sin(X) 9.812 Seconds 1.391 Seconds 1.094 Seconds

Conclusion

So what does this mean for the run based sytem? Well performance was the one area that set it apart from the traditional approach, with both memory and polygon count being very similar. However, with these optimizations its lead in this area has largely been eroded: In fact with the Sin(X) heightmap the traditional approach is now 4 times faster!

I think this is probably the final nail in the coffin for the run based approach – whilst it was a fun and interesting area to investigate, it isn’t a very viable engine. As such don’t expect any further posts on it, if I find an application particularly well suited to it I will post about it, but otherwise I will be working with the approach described in this post. My next goal is to add raycasting to enable player interaction with the world, so expect a post on that at some point in the near future.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>