Sunday, August 22, 2010

32bit Floats - Integer Accuracy and Interesting Consequences

There are big misconceptions I've seen regarding the accuracy of floats.
One professor of mine even went as far as to say, the computer might not store "2" as "2.0", but rather "2.000019" if using floats.
This is not true.

32-bit floats can actually represent quite a large amount of integer values 100% accurately.

The exact integer-range a 32bit float can represent accurately is -16777216 to 16777216.
This means that it is roughly equivalent to a 25-bit signed integer, which has the range -16777216 to 16777215.

If you only care about positive values, then the float is equivalent to a 24-bit unsigned integer, which has the range 0 to 16777215.

I commonly see these values messed up, with people saying a float can only represent a 24-bit signed integer (which would be -8388608 to 8388607, which is wrong).

Below I made a test case to prove the floating point range by exhaustive search.

int testFloatRange(bool pos) {
volatile int i = 0;
volatile float f = 0.0f;
volatile double d = 0.0;
for (;;) {
volatile double t = (double)(float)i;
if ((double)f != d || (int)f != i || t != d) break;
if (pos) { f++; d++; i++; }
else { f--; d--; i--; }
printf("%f != %d\n", f, i);
return pos ? (i-1) : (i+1);

int _tmain(int argc, _TCHAR* argv[]) {
int p = testFloatRange(1);
int n = testFloatRange(0);
printf("Positive Range = 0 to %d\n", p);
printf("Negative Range = %d to 0\n", n);
printf("Full Range = %d to %d\n", n, p);

Now its pretty interesting what happens once a 32-bit float reaches its limit of 16777216.
If you try to increase the float by 1 when it has this value, the float will actually stay the same. This means if you try to increment a float by 1 in a loop, you will never get past 16777216! It will just get stuck in an infinite loop.

Here is some proof of that:

int _tmain(int argc, _TCHAR* argv[]) {
volatile float f = 0xffffff-100;
for( ; f < 0xffffff+100; f++) {
printf("Value = %f, Binary Representation (0x%x)\n", f, (int&)f);

The programs output is:

Value = 16777214.000000, Binary Representation (0x4b7ffffe)
Value = 16777215.000000, Binary Representation (0x4b7fffff)
Value = 16777216.000000, Binary Representation (0x4b800000)
Value = 16777216.000000, Binary Representation (0x4b800000)
... Keeps repeating the last line infinitely...

Admittingly, I didn't know this infinite looping behavior until I made the test-case. This is something you should definitely watch out for.

Oh and btw, you might be wondering why I was using "volatile float" in the above test-cases, instead of just "float". The "volatile" keyword is useful to use when we need to compare floats with their exact precision. I'll probably explain why in another article :)

Friday, August 20, 2010

Checking if a Float is a Power of 2

Last article I showed bitwise ways to check if integers are powers of 2, for completeness I'll show bitwise ways to check if floats are powers of 2.

For floating point numbers, remember that they're represented in the form |S*1|E*8|M*23|.
S = Sign Bit (1-bit)
E = Exponent (8-bits)
M = Mantissa (23-bits)

Check out the IEEE-754 standard for more info:

The exponent is actually used to compute powers of 2, which is then multiplied by the mantissa, which has an implicitly hidden '1.' in front of it.

So when we're checking if a float is a power of two, all we have to do is check if the mantissa is 0, and the exponent is non-zero (when the exponent is zero, the result is zero when the mantissa is also zero, or a denormal value when the mantissa is non-zero, so we don't want to consider these as powers of two).

The code ends up looking like this:

typedef unsigned __int32 u32;

bool isPow2(float f) {
u32& i = (u32&)f;
u32 e = (i>>23) & 0xff;
u32 m = i & 0x7fffff;
return !m && e;

This however will end up counting negative-exponent powers of 2 (such as 2^-1 = 0.5), if this is not desirable, then we can modify the code to only count non-negative exponents (2^0 = 1, 2^1 = 2,...)

The modified code looks like this:

bool isPow2(float f) {
u32& i = (u32&)f;
u32 e = (i>>23) & 0xff;
u32 m = i & 0x7fffff;
return !m && e >= 127;

One last thing, both these versions will also consider negative powers of two (such as -4, -2, -1) to be powers of two.
If this is also undesirable, then we just need to check the Sign-bit of the float to determine if its negative, and if it is, then we will return false.

This last function returns true if the float is a positive power of two with a positive exponent (1, 2, 4, 8, 16...).

bool isPow2(float f) {
u32& i = (u32&)f;
u32 s = i>>31;
u32 e = (i>>23) & 0xff;
u32 m = i & 0x7fffff;
return !s && !m && e >= 127;

Checking if an Integer is a Power of 2

There are a few ways to check if an integer is a power of 2; some better than others.

One crucial thing to notice for power-of-two integers, is that in their binary representation, they only have 1 bit set.

0001 = 1
0010 = 2
0100 = 4
1000 = 8
... and so on...

So one approach we can do to check if an integer is a power of two, is to just loop through the bits, and check if only 1 bit is set.

bool isPow2(uint n) {
const uint len = sizeof(uint)*8; // 32 for 4-byte integers
uint count = 0;
for(uint i = 0; i < len; i++) {
count += n & 1;
n >>= 1;
return count == 1;

Now the above approach is pretty obvious and simple; but its not that nice considering we have to loop for every bit (32 times for 4-byte integers).

There's a popular bitwise trick for determining if an integer is a power of 2, and it looks like this:

bool isPow2(uint n) {
return (n & (n-1)) == 0;

Now this is a lot nicer than our loop version. Instead of looping 32 times, we do just a few bitwise calculations.

I was playing around with bitwise operations earlier today, and I discovered another bitwise method for checking powers of 2.

bool isPow2(uint n) {
return (n & -n) == n;

I was really excited because I found this one out on my own, and I thought I had discovered it. However I did a google search on it, and I found out that this trick is already known :(

The nice thing about this second bitwise version, compared to the more popular version above it, is that its a lot simpler to remember IMO. Speed-wise, they're both around the same, I suppose it depends on your target architecture on which one is actually fastest, but in practical purposes it probably doesn't matter which one you use.

Its important to mention that both of the bitwise methods mentioned above, will incorrectly treat zero as a power of two.
To fix this behavior you can modify the functions like so:

bool isPow2(uint n) {
return n && ((n & (n-1)) == 0);


bool isPow2(uint n) {
return n && ((n & -n) == n);

Sadly this adds an extra conditional to our fast bitwise methods, however it still ends up being a lot nicer (and faster) than our initial loop version.