Saturday, January 29, 2011

Porting cottonNES

Lately I've been working on porting cottonNES to arm based cell phones, using the Airplay SDK which lets me use C++ and compile native code on a variety of arm-based platforms (iOS phones, android phones, etc...).

My original goal was to get cottonNES working on my iphone and I already accomplished that, but it turned out that a cycle accurate interpreted NES emulator on an iphone 3G runs very slow xD (about 5fps w/o frameskipping).

This slowness just won't do, so I said, "hey now i finally have a reason to code a NES dynarec!" So i went to work studying the ARM instruction set and learned the instruction encodings to code an arm code-emitter (which lets me write arm code to memory buffers so that i can later execute the generated code).

Well it turns out that the iOS on the iphone won't let me execute any code that was dynamically generated. I was using the Airplay SDK and mmap() to allocate a buffer with execution privileges, then emit arm code to it, then execute that buffer; but this ends up crashing on my jailbroken iphone 3G with firmware 3.1.2.

So at that point I was disappointed, and pissed off at fucking apple with their shit iOS. But not all hope was lost since I could still just port my emulator to Android phones which DO let me execute dynamically generated code (and therefor allow the possibility of dynarecs/JIT compilers).

The sad thing is i do not have an android phone to test my code on, but there are emulators i can use to test my code on the PC; and hopefully it will work on the real android phones.

I already had someone try my test-app on a real android phone, which tests the ability to execute dynamically generated code. And they told me the app worked on their droid, so this is good news.

If i don't get bored with the project, this should be the first nintendo emulator with a real dynarec ever AFAIK (there's one emulator i've seen claim to use dynamic recompilation, but it was really doing what i call "function caching", and not a real dynarec).

Anyways, here's a pic of the interpreter version running on one of airplay's simulators:

Wednesday, January 12, 2011

Dynamically allocate aligned memory (aligned_malloc)

The other day on irc i noticed some dolphin-emu developers having a problem where they needed to guarantee 16-byte alignment on some dynamically allocated memory.

You might be thinking: What does that mean? Why is this needed?
Well if you've done any programming and dealt with pointers, you should know that the pointer holds a memory address location which is where the data its pointing to is stored.
That memory address is just a number, and for it to be aligned to a 16-byte boundary means that the memory address needs to be a multiple of 16 (think of all of the computer's memory being divided into 16-byte chunks, what we want is the beginning of one of those chunks).

The reason we want this is because there are some cpu instructions that require memory to be aligned to certain byte boundaries. Many SSE instructions require 16-byte alignment when reading from memory locations. SSE deals with xmm registers which hold 16 bytes of data each, so it's instructions like to work with addresses that are multiples of 16 (there are some SSE instructions that don't need alignment, but they're usually slower). If an SSE instruction that required alignment reads data that is un-aligned, your program will most-likely crash.

So this was exactly the problem the dolphin-emu devs were having. They were using SSE instructions which required alignment, but reading data that was not guaranteed to be aligned. So the end result is it produced random crashes.

So how do we solve this problem?

Well there is an msvc function called _aligned_malloc which you can use to allocate memory aligned to whatever byte boundary you want (as long as its a power of 2).
However _aligned_malloc is non-standard and other compilers like GCC won't have it (afaik).

So what we want is a more portable solution that will work on all modern c/c++ compilers. How do we do this? We code our own!


// Alignment must be power of 2 (1,2,4,8,16...)
void* aligned_malloc(size_t size, size_t alignment) {
uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t));
uintptr_t t = r + sizeof(uintptr_t);
uintptr_t o =(t + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uintptr_t*)o)[-1] = r;
return (void*)o;
}

void aligned_free(void* p) {
if (!p) return;
free((void*)(((uintptr_t*)p)[-1]));
}


Now my code above might scare you. But I'll try to explain what its doing.

The first thing to note is how to use it. Basically you allocate an aligned chunk of memory with the aligned_malloc() function, then when you're done you free it with the aligned_free() function.
So to allocate a 160 byte chunk of memory thats aligned to a 16-byte boundary, you can do this:

void* p = aligned_malloc(160, 16);
... do something ...
aligned_free(p);



Now we get to the part of "fucking aligned_malloc, how does it work!?"
The first thing to note is that when we allocate any chunk of memory and we want to align it to some byte boundary, there are N possible different positions that memory can be in terms of alignment (where N = alignment size)
For example, if we want 4-byte alignment, and we do "void* p = malloc(10);", then the last 2 binary-digits of p are either 00,01,10,or 11. And if the last 2 digits are 00, that means that the memory is already aligned, that means that 1/4 times the memory is aligned already for us. 3/4 times the memory is not-aligned to 4-byte boundary (01,10,11). From this relationship we see that out of N different possibilities, N-1 outcomes will not be aligned.

So to align data we just need to allocated "memory + N-1" amount of space, and offset the memory pointer to point to the first aligned chunk of memory.

There is a trick to doing this, and it looks like this:

// p = address of a random chunk of memory
// N = alignment we want; must be power of 2
aligned_offset = (p + (N-1)) & ~(N-1);


With the above code, aligned_offset will equal the nearest aligned boundary. Try and analyze the code yourself to see how it works (note that N has to be a power of 2).

Now that we know the above trick to get an aligned address from any arbitrary address, we now know the basis of aligned_malloc().

The problem is that we can't just offset the pointer and then return that aligned address; because that means when it comes time to free() or delete[] the data, the pointer pointing to the data will not be correct.
So to solve this problem we use another trick. We can allocate a bit more memory to store the starting address of real allocated memory. Then we place that address somewhere in the memory before the offset pointer we return.
So it looks like this:

| unused_data0 | start_address | aligned_data | unused_data1 |


The pointer we're returning points to the beginning of aligned_data. But before that we have the start_address which points to the beginning of unused_data0 (the start of the allocated buffer).

So with that info, you should be able to analyze aligned_malloc() line-by-line and understand what its doing. If you still don't understand leave a comment with what you're having trouble with.

The last thing to look at is aligned_free. All we're doing when it comes time to free the allocated buffer, is we're grabbing the real start_address of the buffer which was placed immediately before our aligned data, and then freeing the memory where start_address points to.

Again if there's any questions feel free to comment. Hopefully my explanation made sense.

Notes:
In the above functions we're using malloc to allocate the memory. We could instead use c++'s 'new[]' operator to allocate the chunk of memory and 'delete[]' to free it. The reason we didn't is because 'new[]' is c++ only, by using malloc the code works with both c and c++. If you don't care about c-compatibility then you can easily convert the function to use new[]/delete[].

Another thing to note is that the above aligned_malloc is storing the full address of the buffer's starting address. If you limit the alignment max to be 2^15 or less, then you can instead store the difference between the returned offset and the start of the buffer which would only take 2-bytes of extra-storage (instead of sizeof(uintptr_t), which is 4-bytes on 32bit os, and 8-bytes on 64-bit os)

Here is what the code looks like for the space-saving version:

// Alignment must be power of 2 (1,2,4,8,16...2^15)
void* aligned_malloc(size_t size, size_t alignment) {
assert(alignment <= 0x8000);
uintptr_t r = (uintptr_t)malloc(size + --alignment + 2);
uintptr_t o = (r + 2 + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uint16_t*)o)[-1] = (uint16_t)(o-r);
return (void*)o;
}

void aligned_free(void* p) {
if (!p) return;
free((void*)((uintptr_t)p-((uint16_t*)p)[-1]));
}

Thursday, January 6, 2011

Using and understanding the static keyword C++, as well as some interesting tricks

There are just so many things I can talk about here, so its difficult to condense this into a small blog post. I would like to rant about how I hate c++'s header/.cpp translation unit model mess, but I won't in order to keep this shorter.

The first thing to mention is that when you compile a c++ project with multiple .cpp files, what essentially happens is all the .h files included in the .cpp files are expanded and merged with the .cpp file to form a translation unit which the compiler uses to generate an object file. The different translation units don't know about the other .cpp files, so that's why you need function and class prototypes and extern-variables (such as 'extern int;') which tell the compiler/linker that these functions/classes/variables exist in some .cpp file. (This crappy design is the worst thing about c++, and it was inherited from the c programming language, most-likely so that the compiler can save memory when compiling your source code. Newer more advanced languages like c# don't need header files because their compilers are smart enough to detect the different functions/classes/variables in separate files without prototypes).

So because of the above design, we need to use crappy header files in our large c++ projects. There are some tricks I've learned over the years to eliminate the need for many duplicate prototype/declaration/definition mess, although it is not commonly used by other c++ coders and they probably won't like the tricks because its not the common way to program. That said I will probably talk more in detail about this in another article.

Now why is the static keyword useful? Well c++ being the evil bastard that it is, has different meanings for the static keyword depending on how its used. So just talking about 'static' in general is not specific enough.

One use of static is in the file-scope. This tells the compiler that the variable/function is unique to the current translation unit (.cpp file) not visible to other .cpp files.

So if you have:

// source1.cpp
static int x = 10;
int main() { cout << x << endl; return 0; }
// source2.cpp
extern int x;
int foo() { x++; }


The above code will give you a linker error after compiling:
source2.obj : error LNK2001: unresolved external symbol "int x" (?x@@3HA)

If you remove the static keyword, then it will compile. The reason is as mentioned before, static will make the variable unique to the translation unit essentially hiding it from other units.

The properties of a static variables declared at global file scope (not in a function) are that they're initiated at the start of the program, and then they're deconstructed at the end of the program.
So if you were to have a line of code that said "static someClass myClass(10);" and it was in global file scope, then the constructor myClass will be called sometime before the main() function in your program, and its destructor gets called when exiting your program.

One misleading thing c++ does is it changes this behavior when you have nested static variables in function scope.

int foo() {
static someClass myClass(10);
return myClass.someFunct();
}


Now you might expect myClass's constructor to be called at the beginning of the program, but this is not the case. Instead the constructor is called the first time the foo() function is ever called! This is a tricky part of the standard that may cause confusion at first.

The interesting thing about this behavior is that although its tricky and seemingly evil, it also is useful. Specifically it allows us to know the time of initialization of the static nested variables. When static variables are initialized in global file scope, it is unknown which order they get initialized in (relative to other static variables, and this is one reason people think they're evil; because they can be initialized before data they depend on is initialized). Using nested static variables eliminates this problem because we know that the first time we're calling function foo(), the constructor of the static variable is called.

Now to take advantage of this we can declare static variables like this:

// Instead of doing this:
// static someClass myClass(10);

// We do this:
someClass& myClass() {
static someClass _myClass(10);
return _myClass;
}


Now we can use 'myClass()' and we know that the first time its called will be when the constructor is called.


Now moving on, we can have static functions as well.
Static functions behave like static variables at file scope because it creates a unique function for the .cpp file where its defined, and its hidden from the other .cpp files.
You might think it doesn't matter, and it doesn't for the most part, but imagine the following code:

// header.h
static int& getInt() {
static int i = 77;
return i;
}

// source1.cpp
#include "header.h"
void foo();

int main() {
foo();
cout << getInt() << endl;
return 0;
}

// source2.cpp
#include "header.h"
void foo() {
getInt() = 10;
}


What do you think the output of the program will be?
You might think the output should be 10, but its not, its 77.

The reason for this is that we declared the function getInt() as static, that means for both source1.cpp and source2.cpp, it created separate local versions of the getInt() function. So when getInt() is called in source2.cpp's foo(), it is a separate function than getInt() the one source1.cpp sees, so when source1.cpp prints out getInt() it prints its own version which was initialized to 77.

What happens if we remove the 'static' keyword from 'static int& getInt()', well we get linker errors:
1>source1.obj : error LNK2005: "int & __cdecl getInt(void)" (?getInt@@YAAAHXZ) already defined in source2.obj

This is because the function is trying to be defined in both .cpp files so the linker doesn't know which version to use.
The usual way to prevent these problems is to define the function in one .cpp file, and then have the function prototype in the header like so:

// header.h
int& getInt();

// source1.cpp
#include "header.h"
void foo();

int& getInt() {
static int i = 77;
return i;
}

int main() {
foo();
cout << getInt() << endl;
return 0;
}

// source2.cpp
#include "header.h"
void foo() {
getInt() = 10;
}


The program now prints what you expect it to print, "10".


Now that we got that out of the way, we can talk about the static keyword in classes and structs.

When you have a struct/class with a static member variable, you're required to declare it in the struct, but then define it outside the struct.
So you do:

// header.h
struct someStruct {
static int i; // declare
// static int i = 10; // c++ doesn't allow this
};

// source1.cpp
#include "header.h"

// This line must go in only 1 translation unit
// so it should only be put in one .cpp file
int someStruct::i = 10; // define i


There will now be only 1 instance of someStruct::i, and you can use it throughout your code such as 'someStruct::i = 49'.

If you noticed though c++ makes you split the definition of the variable from the declaration; and the definition is put into only 1 .cpp file.
Well there is workaround for this which lets you define it in the same place you declare it.

Before I mention the solution, I'll mention another problem with c++.
What if you wanted to do this:

struct someStruct {
static const int i = 24; // This compiles
static const char* message = "Hello world"; // This doesn't
void foo() { cout << message << " " << i << endl; }
}


The only thing c++ allows us to initialize within a class are "static const int" variables, anything else it requires you to split the initialization part into a .cpp file.

Now whats the solution?
static member functions with static local variables!

Here's cool way to fix the problem:

struct someStruct {
static const int i = 24;
static const char* message() {
static const char* _message = "Hello world";
return _message;
}
void foo() { cout << message() << " " << i << endl; }
}


The evil part about c++ is that static member functions with static local variables behave differently than static file-scope functions with static local variables.

This example will illustrate that:

// header.h
static int& getInt() {
static int i = 77;
return i;
}

struct someStruct {
static int& getInt() {
static int i = 77;
return i;
}
};

// source1.cpp
#include "header.h"
void foo();

int main() {
foo();
cout << getInt() << endl;
cout << someStruct::getInt() << endl;
return 0;
}

// source2.cpp
#include "header.h"
void foo() {
getInt() = 10;
someStruct::getInt() = 10;
}


What do you think the output of the program will be?
Well the first cout with the file-scope function we know will print '77', because this is the same example i gave earlier in this article. But strangely the second line with 'someStruct::getInt()' returns 10!

The reason for this is that static local variables in class member functions will only have 1 instance of themselves, even throughout different translation units!

This is a powerful thing to know and can be used for random tricks.

This last example i'm going to give will show you a trick so that you can declare and define static variables in a header file that are used in different .cpp files, but the variable will be the same throughout all the different translation units (so it doesn't make unique copies like normal static at file-scope does).

// header.h
struct someStruct {
static int& getInt() {
static int i = 77;
return i;
}
};

static int& i = someStruct::getInt();

// source1.cpp
#include "header.h"
void foo();

int main() {
foo();
cout << i << endl;
return 0;
}

// source2.cpp
#include "header.h"
void foo() {
i = 10;
}



The above code prints out 10.
Notice how we didn't have to declare any variable in the individual .cpp files and could declare and initialize the variables all in the header file.
This trick can be useful, so use it wisely.

Hope this article was useful for at least someone :p