Category: Programming


Apart from HMDs the other main focus was changes to the way graphics are programmed. The last major release of DirectX, DirectX11 was 5 years ago. Since then a lot has changed around DirectX11. Most notably multicore CPU processors are far more common. The changes from DirectX11 to DirectX12 seem to be focused on enabling programmers to get more out of the CPU rather than the GPU, as well as reducing the number and the size of the calls that need to be sent to the graphics card.
DXII

Below is my summary of the big changes. Bear in mind though, that DirectX12 isn’t scheduled to be released until around Christmas 2015 and I’ve not played with it, so I have no first-hand experience. Take everything with a pinch of salt. First though, here is a recap of graphics hardware.

So much of the process of moving model vertices in 3D space, squashing triangles on to the screen and shaded pixels is completely independent of other vertices and pixels. That means that the results don’t depend on each other, and so they could be done in any order. This type of problem is sometimes referred to as an “embarrassingly parallel” problem. The vast majority of computation can be done at the same time. That’s what your graphics card is for. It’s a hugely powerful parallel computer, capable of running the same program on different bits of data at the same time. It’s kind of like having a fleet of Ferraris sitting in your PC waiting to execute your code.

This super computer is controlled using instructions from the CPU running the application, and the changing of state each frame is what takes up most of the time. These instructions are sent over the data bus – visualised using an actual bus. Each frame these buses carry instructions and data to the fleet of Ferraris, telling them what they should do. This bus is the bottleneck, so the less we have to use it the better. Another constraint of this model is that much of the state had to be sent from a main render thread. In DX11 something called Deferred Contexts tried to overcome part of this problem, allowing any thread to send data to the graphics card, but because of the nature of relationships between data sent from different threads there still needed to be a lot of communication and synchronisation with the main thread. This synchronisation means that a lot of the time thread and CPU cores become idle whilst they wait on results from other cores. Additionally a lot of data associated with the Deferred Contexts had to be sent on the bus. These are the problems that DirectX 12 is trying to overcome.

In DirectX11 you have to take the bus!

In DirectX11 draw calls include a lot of data and state changes, so you have to take the bus!

The strategy is to remove dependencies between calls sent from different cores, and to reduce the amount of data that has to be sent on a frame by frame basis. It’s a little bit like trying to replace the data bus with a smaller, more agile data Ducati. Here’s how (I think) it works:

New data structures that are stored on the graphics card are better aligned with the hardware, removing the need to build up parts of the pipeline during the draw call and enabling more draw calls per frame. The objects, called Pipeline State Objects can still be swapped in and out at run time, but they are created and saved on the graphics cards ahead of time.

Command lists are a similar to what DX11 tried to achieve with Deferred Contents. Commands can be compiled and sent to the graphics card from any thread, but because of the changes with the introduction of PSOs executing these command lists are no longer so large, and because they can share PSOs with other draw calls they are less dependent on one another. They just store the information about which PSO they should use and send the calls off to the graphics card.

Bundles offer similar functionality, but allow some state to be inherited from other calls, and some state to be changed. This means that the instructions for a bundle are computed once and replayed with different variables. Whilst the intention appears to be that command lists are constructed every frame and then discarded, bundles seem to be a way of computing commands and saving them between frames to render with different data (both in the same frame, and in different frames).

DirectX 12 allows more state to be stored in graphics memory, meaning smaller, faster draw calls.

DirectX 12 allows more state to be stored in graphics memory, meaning smaller, faster draw calls.

Finally, Descriptor Heaps give the power to the programmer to build their own heap and table of resources in graphics memory. This means that state concerning the current resources that are being used no longer has to set by the CPU. Instead, the GPU can request resourced from a list held in graphics memory without the need for a call from the CPU to bind that resource.

All of these improvements mean that draw calls are smaller, and can be executed more quickly, which means that there can be more draw calls in any frame. It also means that there is less need for synchronisation between CPU threads, which means less time wasted waiting and frees the CPU to spend more time doing useful processing.

Some of the best news is that unlike the change to DirectX11, which required many people to buy new hardware, DirectX12 will work on many existing graphics chips, including the chips in the Xbox One! Exciting times ahead!

Part of my job over summer is to help to prepare software to be deployed into the teaching labs. Mostly this involves harassing staff to find out what software packages they need, but occasionally the problems get a little more interesting. This year that was the case with XNA.

xna_logo

XNA is a great framework for creating games, and we use it as a tool to motivate students to learn how to program whilst creating great games at extra-curricular events such as the three thing game. For us, a tools like XNA is an invaluable intrinsic motivator – inspiring students to want to learn to code, as opposed to being motivated because we said so, or because they will get better grades.

According to the official documentation XNA requires Visual Studio 2010. Now, clearly it’s possible to install both Visual Studio 2010 and Visual Studio 2012 on the same machine, but that would have a big impact on the size of the image. We’d rather not install both if we don’t have to, but if you try to install XNA on a machine that doesn’t include Visual Studio 2010 the installation will fail.

We’re also keen to provide students with as seamless an experience as possible when moving from working at home to working in university – although in the university labs we do insist that students wear pants at all times. It’s with this in mind that I’m writing this blog post, so that students can use XNA at home with Visual Studio 2012 to allow them an easy transition to and from the machines at the University.

After some digging around as well as a decent amount of experimentation we found a solution. Although I’m not 100% clear on how it works my interpretation. Normally when you install XNA there are a bunch of steps to go through. One of the steps copies some files to the Visual Studio directory, which it assumes is Visual Studio 2010. When that step fails the installer rolls back all the other changes like any good installer should. This process extracts each individual step, and when the time comes requires you to manually copy the files to the correct directory.

You’ll need to download this zip file which contains the entire XNA setup and the folders that you’ll need to copy yourself.

  1. Download the zip file and unzip it somewhere. You should see an executable called XNAGS40_setup.exe and a folder called XNA Game Studio 4.0
  2. Open a command line and navigate to the folder that contains XNAGS40_setup.exe – then run XNAGS40_setup.exe /x . You’ll be asked to enter a folder. It’s probably easiest if you create a new empty folder. This folder is temporary and can be deleted after you are done.
  3. Go to the temporary folder and run redists.msi
  4. Run the MSI at %ProgramFiles%\Microsoft XNA\XNA Game Studio\v4.0\Setup\XLiveRedist.msi
  5. Run the MSI at %ProgramFiles%\Microsoft XNA\XNA Game Studio\v4.0\Redist\XNA FX Redist\xnafx40_redist.msi
  6. Run the MSI at %ProgramFiles%\Microsoft XNA\XNA Game Studio\v4.0\Setup\xnaliveproxy.msi
  7. Run the MSI at %ProgramFiles%\Microsoft XNA\XNA Game Studio\v4.0\Setup\xnags_platform_tools.msi
  8. Run the MSI at %ProgramFiles%\Microsoft XNA\XNA Game Studio\v4.0\Setup\xnags_shared.msi
  9. Copy folder XNA Game Studio 4.0 provided in the zip file you downloaded at the start to C:\Program Files (x86)\Microsoft Visual Studio 11.0\Common7\IDE\Extensions\Microsoft
  10. Go to the temporary folder you extracted to in step 2 and run the MSI named arpentry.msi
  11. Open a cmd window and run “C:\Program Files (x86)\Microsoft Visual Studio 11.0\Common7\IDE\devenv.exe” /setup
  12. Delete the temporary folder you created, as well as the zip file and the folder you extracted that to. You don’t need those any more.
  13. Create some awesome games using XNA! 😀

That should have done the trick. Let me know if it works for you :D, or even if it doesn’t 😥

A month or so ago I said I would try to optimize Jason Milldrum’s version of Conway’s game of life running in Minecraft-pi. So now I’ve done that and I’ve tried to keep some decent documentation of the process. This post is the summary of that documentation, but if you’re just interested in the results here they are:

The first step was to add checks to find out where things were running slowest. Python has a really useful module for dealing with time, so writing code for that was easy. The code Jason wrote has three main parts – the part where the world is displayed, the part where the world is updated, and the part where the updated world is copied to the displayed world. The big bottleneck was where the world was displayed. This took around 20 seconds a time. Updating took 2 to 3 seconds and the copy was superfast and so negligible.

The display was taking so long because the entire world was being reset every frame – instead of just changing the blocks that are being changed. The first strategy for optimization was to keep a list of cells that were changing from living to dying and cells that were changing from dying to living. Then when the frame is rendered only those cells that have changed state will be set, reducing the amount of work by around a quarter.

Now the rendering runs at around five seconds initially, and as cells die out the rendering time decreases to around two or three seconds. That’s as fast as the rendering can really go. This is still going to be the bottleneck, and there is also an issue in that because the cells die and are born in different phases it’s more difficult to comprehend what is going on because you may be in the middle of a phase. I have an idea for that but in the meantime let’s see what we can do to optimise the update.

Initially I was going to reduce the number of loops by trying to calculate values for adjacent cells at the same time as the current cell, but instead I decided to keep track of the number of live neighbours. Now during each pass I check the cached number of neighbours, and if something lives or dies I increment or decrement the array for the next lot of neighbours. Initially this didn’t seem to make very much difference, then I remembered that a lot of the calculations I was doing were pointless because I has shifted the whole world in an earlier step. It didn’t seem likely to bear fruit, but I removed any calculation that was minus zero. Still no real speed increase, then I realised that I could take away the edge effects and deal with them separately, but a quick test showed that didn’t help either. I was beginning to lose hope, and then I considered the trick I had already used for rendering. What if I returned to the counting approach but kept a list of cells with live neighbours. Then I would expect to see a similar speed up as I did before, with increasing speeds as the numbers of cells with neighbours decreases. By taking this approach, and keeping a rolling track of how many neighbours each cell has the update was reduced from 2 or 3 seconds to a tenth of a second. An added bonus of running the simulation for longer means that as less cells are involved rendering drops to between 0.5 and 1.5 seconds. Nice!

One disadvantage of this approach is that the list of cells that have live neighbours (or are alive themselves) is not in order, and as all the cells dying and being born are dealt with in separate lists this meant that sometimes the simulation looked a little confusing. You couldn’t really see the start or the end, and what was happening was no longer clear. To try to combat this I tried to add some more information by colour coding the different cells before they change, to indicate why they are about to change. Cells that die of overcrowding will be marked red. Cells that die of under-population will be marked magenta, and cells that are about to be born will be marks yellow. So now we have two phases – the change that is about to happen and the change itself. This means that now more work will be done, but hopefully it will give some insight into what is going on. It may seem more responsive, and also might give away clues as to how minecraft-pi updates the screen in sections, and it should certainly be more colourful.

As you know, the results are presented above in a video that was surprisingly easy to put together (thanks for reading this far, by the way!). The entire process now takes just under five seconds – up from about 2 seconds for the non-technicolour version – but then the new visualisation involves double the “rendering” calls to Minecraft-pi.

Whilst it’s not perfect I’m quite pleased with the results. Thanks go to Jason Milldrum for the inspiration, and for the challenge, which I’ve thoroughly enjoyed.