Software by Steven

Sometimes a square peg is a round hole

Tag Archives: game programming

Cluster size, Cache Lines and Why Old Video Games had 3-letter High Scores

Like many other software developers, I’m constantly trying to deepen my understanding of things, to learn more and push myself further. As such, I took it upon myself recently to try and make a high score system for a 3D game I’d made last year. I wasn’t content with just a simple “cout >> score”, however, I wanted to make this an academic exercise. I wanted the “perfect” high score system.

So I started researching. Given that this dealt with IO, I wanted to make sure I adhered to a few good practices, and to learn about others. For this, I had to consider the low-level implementation details. First consideration…

Cluster Size

Windows file systems (NTFS, FAT and exFAT) allocate space on the hard drive according to something called cluster size. File allocation on the hard drive must be a multiple of the cluster size. Files up to the cluster size (lets say 4 KB) will take up that much space on the disk. If the file contents amount to less (lets say a 2 KB text file), the rest of the drive space is padded. This is done for speed reasons. Files larger than the cluster size (6 KB) will have extra clusters allocated to allow for the entire file to be stored (8 KB on disk). The relationship can be generalized as:

disk space = round up ( file size / cluster size )

One downside of this is that every file which isn’t equal to exactly the cluster size has wasted space. Another is that since the OS reads in information in entire clusters, performance suffers if file size is less than the cluster size (only some of the data on the cluster is relevant). While the age of terabyte hard drives makes the first downside negligible, the second should still be a concern. However, given this being an academic exercise, I opted to make my high scores file be the cluster size.

So just how big might the cluster size be? Optimal cluster size varies by file system, operating system and hard drive size, but by and large a reliable figure is 4KB. According to Microsoft this is the default on NTFS for any Windows NT 4.0 machine, or anything newer than Windows 2000 with a hard drive size under 16 TB. Given that the game will require DirectX 9 and will likely run on a home computer, the Windows 4-6 family is almost certainly to be the host OS, and NTFS is a near-guarantee for the file system. So until 16 TB hard drives are common, I should be safe with a 4KB file size 🙂

Cache Line Size

Here’s where I had to begin to take a lot of consideration and care. With varying types of memory in computers (registers, L1 Cache, L2 Cache etc) I wanted to try and make my data structures as accommodating as possible. For this, I wanted to be sure they would fit on the L1 Cache as cleanly as possible. This way, more data could be kept there for quick transfer to the CPU rather than having to retrieve values from secondary memory like RAM. Turns out, both AMD and Intel recommend that for this, structure size should be an integral multiple of the cache line size, and aligned on that byte boundary. In other words, an L1 Cache with a cache line size of 64 works best on a 64 (or 128) byte structure stored in memory at an address divisible by 64.

Turns out, nearly all modern processors are 64-bit and work off a 64 byte L1 Cache. It’s not a guarantee to say than an n-bit processor will have a n-byte L1 Cache line size, but based on my research in the Pentium family, it’s a fair assumption. So I started trying to juggle how to work with these 64 bytes. I didn’t see score going above 4 billion, so 4 bytes seemed sufficient for this. I wanted to store a date as well, which at an 8 byte value seemed enough. With 12 bytes down and 52 to go, it seemed 64 was a bit much. Aiming for 32, I then gave an extra 4 bytes for a bit field. 16 bytes down, 16 to go. And I needed a name. 16 bytes makes for an array of 8 wchar_t variables, which meant a name of 7 letters could be supported (plus 1 null byte on the end). Why wchar_t? I wanted to go beyond the 1-byte “char” to not limit myself to ASCII characters.

#define CACHE_LINE_SIZE 32
#define NAME_SIZE (CACHE_LINE_SIZE - sizeof(int)*2 - sizeof(long))/2
typedef struct score_t {               // Engineered to be CACHE_LINE_SIZE bytes in size
 int score;                            // Most commonly used member first, used in ordering (4 bytes)
 int flags;                            // Score flags (4 bytes), helps natural alignment of next item (the date)
 long date;                            // Time stamp (8 bytes)
 wchar_t name[NAME_SIZE];              // Player's name, takes up remainder. Place last to ease natural alignment
};

With this structure, I was able to fit 2 scores in one cache line. Upon seeing if I could fit 4, I realized something. At a 16 byte struct size, I would’ve stripped out the bit field, which if you do the math would allow for 3 letters to be specified as the earner of the high score. That sounded familiar…

Pac-Man Hi Score

Pac-Man Hi Score

Notice how there’s only three letters in the name field? A lot of old 8-bit games only allowed for 3 letters… Pac-man (above), Galaga, Contra, and Mario among them. Given the knowledge that they ran on an 8-bit processor and the assumption that an 8-byte cache line size was involved, along with an assumption that the character set was likely limited to ASCII characters, you have the following variation on the above:

#define CACHE_LINE_SIZE 8
#define NAME_SIZE (CACHE_LINE_SIZE - sizeof(int))
typedef struct score_t {               // Engineered to be CACHE_LINE_SIZE bytes in size
 int score;                            // Most commonly used member first, used in ordering (4 bytes)
 char name[NAME_SIZE];                 // Player's name, takes up remainder. Place last to ease natural alignment
};

Do the math and it works out that there is just enough room for 3 letters to be given. Sure, the size of an int wouldn’t’ve been 4 bytes on that processor, but I’m going to guess that based on the perfect Pac-man high score being realized at 3,333,360 that 4 bytes were allocated for the score field all the same. Maybe first, middle and last initials weren’t the real reason for having a 3-letter high score entry.

At the end of it all, I used the first code snippet in my game: 4,096 bytes of high scores at 32 bytes/score makes for a hefty 128 high scores. However the real joy I took out of this wasn’t the code but gaining new insight into some of my first games, with a  like-mindedness of trying to eke out every cycle possible. Code on.

Unravelling Win32 Threads

For the 3D game we’re working on, I’ve also taken it upon myself to help alleviate load times. We’re using Collada models and it turns out they can get very, very large. Collada files are XML formatted and as such, text takes a while to read in. A common alternative is to convert these static XML files to binary files for quicker reading, but unfortunately we don’t have the time to implement this. So we decided to investigate multithreading and asynchrony. Turns out, there’s a lot in the Windows API to learn.

Windows has two large APIs for threading: managed and unmanaged. Managed ties into the Common Language Runtime (CLR) and is used in .NET or Managed C++ applications. This was not what I wanted, I wanted the raw, low level control. Being familiar with the CLR implementation though, I found a Managed -> Unmanaged API Mapping useful. Unfortunately, this only mapped out the basics.

From here, I had other decisions to contend with. A lot of the basic functionality exists in the old Windows NT 4.0 functions, but Vista shipped with a whole bunch of specialized functions like a revamped ThreadPool family. Using an older version of DirectX (DirectX 9 mostly), I wanted to be sure things would be backwards-compatible to XP. So as neat as their were, they too were out. I did manage to find another great mapping, this one of the Original ThreadPool API -> New ThreadPool API.

For now though, I only want a single thread so I stuck with CreateThread(). From here, I wanted some way to test the thread’s state (if it was signaled as finished, aborted, etc). There’s a set of functions called WaitForSingleObjectEx and WaitForMultipleObjectsEx which will pause for a predetermined amount of time, returning a status code indicative of the thread(s) status. From here you can get the return code of a thread if its finished by using the GetExitCodeThread() function.

This was straight-forward enough, but I wanted some way to execute a function once the remote thread was done. The reason for this was to notify the parent thread. I began looking into the QueueUserAPC function. No matter what I tried though, it would always execute at the beginning of the thread, not the end. This is despite heeding documentation warnings about being sure the thread is starting before queuing the callback. As I’d later find out from a forum post of someone else in my position, this was not the right way to do things.

Turns out there are different type of threading models, and Windows implements the “pull” model rather than the “push” model. This means that rather than child threads pushing a message to the parent thread to respond to right then, they must queue it up in a message queue just like how it works with Windows GUI events. In other words, if Windows is a librarian and a child thread is an employee tasked with fetching a book for a customer, the employee must not interrupt the librarian when returning with the book. This would be bad, as the librarian may be conversing with and helping the customer. Rather, the employee must signal to the librarian that they have the book and the librarian will take the book when appropriate. There are two ways to do this: PostMessage and PostThreadMessage. The former specifically takes an HWND handle for the window to post to, while the latter takes a DWORD threadId. I used PostMessage because I was posting back to the GUI thread but if I wasn’t I would’ve had to force the creation of a message queue. This would be done by calling this in the parent thread:

PeekMessage(&msg, NULL, WM_USER, WM_USER, PM_NOREMOVE)

With a way to notify the parent thread, I was ready. Here was my plan:

  • Start up a loading screen
  • Begin loading model information on a background thread
  • Store all information in a pointer in memory
  • Notify the GUI thread when done
  • Have the GUI thread use the pointer

At first, I tried to be creative. With the GetExitCodeThread method, I tried to return the pointer as a return code for reference by the GUI thread. This was possible because both memory addresses and return codes are 32-bit words. I could just treat the memory address pointed to by the pointer as a numeric value (cast the pointer to a DWORD, or unsigned long), returning that value on the child thread’s function and then cast back again on the parent thread. This worked in debug mode, but for some reason I received Access Violation errors when running in release mode. I’m not sure why, perhaps thread permissions are altered for speed as part of the release build compilation options. At any rate, I needed to think differently.

Enter a storage class. I named mine ThreadMarshaller. It’s basically just an array of void pointers, and holds 2 functions: set(void*, int) and get(int). In the child thread, set a value at an index which is then “getted” from the parent thread after being signaled. Simple, easy, and works in debug and release mode. I had working asynchrony.

I feel like there was a much better way to do this, though. The SetEvent() function, for example, is used to signal a thread, the same signal the above-mentioned WaitForObjectEx function checks for. There’s also something called Thread Local Storage which seems to be a native Windows version of my ThreadMarshaller. Also, using ThreadPool rather than manual thread handling would likely provide for a more scalable solution. Alas, I wasn’t able to play with them all. Hopefully soon 🙂

Assembling a Lower Level understanding

Time for something different from my usual bloggings about JavaScript and the BBB project. Alongside it, I’m on a team of five working to produce a 3D game using C/C++ and DirectX. Luckily, we’ve been granted a custom framework by our professor but it’s still a lot of tough work. Where as many developers take a “get it working and make it look nice”, it has become habit to me to incorporate one other maxim: “get it working blindly fast”. And it’s being put to the test.

Enter real-time 3D systems, where you have 1/60 of a second to do every calculation required for your application to run. Sure, you can go a little further, most people won’t notice a difference between 30 fps and 60 fps, but with 60 fps being the norm… well it’s tough. And in perf-grinding fashion, I decided I’d take a shot at collision detection where there’s iterations upon iterations of computations.

THE REHEARSAL

First shot went alright. It worked in theory but when implementing it… 1 fpm. No, not 1 frame per second, 1 frame per minute. Turns out I’d underestimated the effect of a nested loop on a data set of size 10,000. So I had to start thinking deeper. And lower. It crossed my mind that to make this work, I may have to cross that hallowed C/Assembler boundary and to do it on my own. So I did some research. Basically all I wanted was an absolute value of some arithmetic to operate inside a nested loop. First shot looked something like this (note comments in assembly start with a semicolon):

register int diff = collidable[j]->getPartition();

__asm {
mov edx,[part1] ;Move part1 to register edx
cmp eax,edx ;Compare the accumulator (eax) which should still hold the value of diff (which is also stored in another register) to edx
jge lblSubtract ;Jump to lblSubtract label if edx is greater than or equal to eax
xchg eax,edx ;Exchange register values
lblSubtract: sub eax,edx ;Subtract edx from eax and store result in eax
mov [diff], eax
}

A few things I learned about assembly (ASM) first: Intel processors are optimized for operation to occur on the eax (accumulator) register, with data for operations coming from the edx (data) register. Other convention registers exist (ecx for count variables, for example) but I wanted to get my feet wet, not drown. For the interested, some have it down to an art.

Also, ASM is very different from C. C does all the memory movement for you, but in ASM you must indicate that you want memory moved to the registers to work with. You can issue instructions to work with it in memory, but these tend to be much more expensive in terms of clock cycles and bytes. Exactly the repercussions of these I’m still learning, I know it’s like saturated fats: keep ’em low. For each desired action too, you must issue the instruction. I’ve commented a few above, but a complete list of Intel instructions (up until 80486) can be found here. I found manuals on the rest of the Intel lineup too.

I kept digging.

THE TESTS

The first approach seems like it’s a pretty good way, huh? Turns out it isn’t. I did some more exploring and learned a few more things. I had three test cases to find the absolute value of a subtraction:

1. Ternary operations

register int diff collidable[j]->getPartition() > part1 ? collidable[j]->getPartition() - part1 : part1 - collidable[j]->getPartition();

2. Negative multiplication

register int diff collidable[j]->getPartition()-part1;
if (diff < 0)
diff *= -1;

3. Inline Assembly (repeated from above)

register int diff = collidable[j]->getPartition();

__asm {
mov edx,[part1] ;Move part1 to register edx
cmp eax,edx ;Compare the accumulator (eax) which should still hold the value of diff (which is also stored in another register) to edx
jge lblSubtract ;Jump to lblSubtract label if edx is greater than or equal to eax
xchg eax,edx ;Exchange register values
lblSubtract: sub eax,edx ;Subtract edx from eax and store result in eax
mov [diff], eax
}

I tested each both with and without compiler optimizations. The results surprised me.

Option 1, the ternary operator, was incredibly slow unoptimized. Checking the disassembly, that one line of C code compiled down to an astonishing 41 instructions. Optimized fared significantly better at 17 instructions. Looking good so far. Not really, actually. The inline assembly approach yielded 17 instructions unoptimized and a blitzing 11 optimized. So with inline assembly I could debug with the performance of an optimized build. Of course I lock it into Intel and AMD processors and leave the PowerPC out but who needs anything but Windows? We’re using DirectX, surely there’s no possibility of porting the framework to OpenGL? Actually, there may be. Oops.

But there was still approach 2. That couldn’t be any good though, it is 3x the number of lines of C than approach 1, and doesn’t have the touch of custom assembly. The results really opened my eyes. Unoptimized, it yields 17 instructions, same as unoptimized inline assembly. Optimized, it gets quashed down to an unbelievable 8 instructions. That’s 5 times fewer than the single line of C tertiary operator and 3 quicker than optimized inline assembly. How was this possible?

THE ANALYSIS

Remember the slave compiler who optimizes everything for us? Turns out he’s not such a bad guy after all, just gets a little confused sometimes. The inline assembly to “improve” the calculation of an absolute value between an intermediate calculation and it’s use just below interfered with the compiler’s optimization heuristic. Comparing disassemblies between optimized approaches 2 (3-line C) and 3 (inline assembly) showed that the three extra operations were from moving the value of diff to and from eax. Without the inline assembly mucking things up, the compiler could see that that was all it needed to work with and just kept it in eax. It even kept it in after calculating the absolute value for use in computations just below. My two hours of careful work and research had actually just gummed up the works. And while 3 or even 30 operations don’t seem like much, it matters inside a block of code which needs to operate 100,000,000 times in 1/60 of a second.

Speaking of those iterations, turns out even those tweaks didn’t do much. It still ground along until I took a look at the bigger picture. Compared to the 2 hours optimizing the inner loop, the 30 minutes I spent changing the algorithm not only made the game playable, but it’s now humming along at 58 fps. So what did I learn?

  • Work with the compiler, not against it. It makes a great development tool.
  • If you have performance-critical code and can’t figure out which approach to take, take them all and investigate how they turn out.
  • Use the new kid on the block: C. The end result will probably be more elegant, quick and portable than if ASM gets involved.
  • Favour global optimizations over local optimizations EVERY TIME.

Four lessons in two and a half hours, and I feel like I’ve only just scratched the surface.