Wednesday, January 12, 2011

Dynamically allocate aligned memory (aligned_malloc)

The other day on irc i noticed some dolphin-emu developers having a problem where they needed to guarantee 16-byte alignment on some dynamically allocated memory.

You might be thinking: What does that mean? Why is this needed?
Well if you've done any programming and dealt with pointers, you should know that the pointer holds a memory address location which is where the data its pointing to is stored.
That memory address is just a number, and for it to be aligned to a 16-byte boundary means that the memory address needs to be a multiple of 16 (think of all of the computer's memory being divided into 16-byte chunks, what we want is the beginning of one of those chunks).

The reason we want this is because there are some cpu instructions that require memory to be aligned to certain byte boundaries. Many SSE instructions require 16-byte alignment when reading from memory locations. SSE deals with xmm registers which hold 16 bytes of data each, so it's instructions like to work with addresses that are multiples of 16 (there are some SSE instructions that don't need alignment, but they're usually slower). If an SSE instruction that required alignment reads data that is un-aligned, your program will most-likely crash.

So this was exactly the problem the dolphin-emu devs were having. They were using SSE instructions which required alignment, but reading data that was not guaranteed to be aligned. So the end result is it produced random crashes.

So how do we solve this problem?

Well there is an msvc function called _aligned_malloc which you can use to allocate memory aligned to whatever byte boundary you want (as long as its a power of 2).
However _aligned_malloc is non-standard and other compilers like GCC won't have it (afaik).

So what we want is a more portable solution that will work on all modern c/c++ compilers. How do we do this? We code our own!

// Alignment must be power of 2 (1,2,4,8,16...)
void* aligned_malloc(size_t size, size_t alignment) {
uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t));
uintptr_t t = r + sizeof(uintptr_t);
uintptr_t o =(t + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uintptr_t*)o)[-1] = r;
return (void*)o;

void aligned_free(void* p) {
if (!p) return;

Now my code above might scare you. But I'll try to explain what its doing.

The first thing to note is how to use it. Basically you allocate an aligned chunk of memory with the aligned_malloc() function, then when you're done you free it with the aligned_free() function.
So to allocate a 160 byte chunk of memory thats aligned to a 16-byte boundary, you can do this:

void* p = aligned_malloc(160, 16);
... do something ...

Now we get to the part of "fucking aligned_malloc, how does it work!?"
The first thing to note is that when we allocate any chunk of memory and we want to align it to some byte boundary, there are N possible different positions that memory can be in terms of alignment (where N = alignment size)
For example, if we want 4-byte alignment, and we do "void* p = malloc(10);", then the last 2 binary-digits of p are either 00,01,10,or 11. And if the last 2 digits are 00, that means that the memory is already aligned, that means that 1/4 times the memory is aligned already for us. 3/4 times the memory is not-aligned to 4-byte boundary (01,10,11). From this relationship we see that out of N different possibilities, N-1 outcomes will not be aligned.

So to align data we just need to allocated "memory + N-1" amount of space, and offset the memory pointer to point to the first aligned chunk of memory.

There is a trick to doing this, and it looks like this:

// p = address of a random chunk of memory
// N = alignment we want; must be power of 2
aligned_offset = (p + (N-1)) & ~(N-1);

With the above code, aligned_offset will equal the nearest aligned boundary. Try and analyze the code yourself to see how it works (note that N has to be a power of 2).

Now that we know the above trick to get an aligned address from any arbitrary address, we now know the basis of aligned_malloc().

The problem is that we can't just offset the pointer and then return that aligned address; because that means when it comes time to free() or delete[] the data, the pointer pointing to the data will not be correct.
So to solve this problem we use another trick. We can allocate a bit more memory to store the starting address of real allocated memory. Then we place that address somewhere in the memory before the offset pointer we return.
So it looks like this:

| unused_data0 | start_address | aligned_data | unused_data1 |

The pointer we're returning points to the beginning of aligned_data. But before that we have the start_address which points to the beginning of unused_data0 (the start of the allocated buffer).

So with that info, you should be able to analyze aligned_malloc() line-by-line and understand what its doing. If you still don't understand leave a comment with what you're having trouble with.

The last thing to look at is aligned_free. All we're doing when it comes time to free the allocated buffer, is we're grabbing the real start_address of the buffer which was placed immediately before our aligned data, and then freeing the memory where start_address points to.

Again if there's any questions feel free to comment. Hopefully my explanation made sense.

In the above functions we're using malloc to allocate the memory. We could instead use c++'s 'new[]' operator to allocate the chunk of memory and 'delete[]' to free it. The reason we didn't is because 'new[]' is c++ only, by using malloc the code works with both c and c++. If you don't care about c-compatibility then you can easily convert the function to use new[]/delete[].

Another thing to note is that the above aligned_malloc is storing the full address of the buffer's starting address. If you limit the alignment max to be 2^15 or less, then you can instead store the difference between the returned offset and the start of the buffer which would only take 2-bytes of extra-storage (instead of sizeof(uintptr_t), which is 4-bytes on 32bit os, and 8-bytes on 64-bit os)

Here is what the code looks like for the space-saving version:

// Alignment must be power of 2 (1,2,4,8,16...2^15)
void* aligned_malloc(size_t size, size_t alignment) {
assert(alignment <= 0x8000);
uintptr_t r = (uintptr_t)malloc(size + --alignment + 2);
uintptr_t o = (r + 2 + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uint16_t*)o)[-1] = (uint16_t)(o-r);
return (void*)o;

void aligned_free(void* p) {
if (!p) return;

No comments:

Post a Comment