Tuesday, March 13, 2012

Table Bitpacking to avoid lookup tables and switches / Reversing Nibbles and Bytes

Problem: Write a c++ function to reverse a nibble (a nibble is 4 bits); you may assume the nibble 'n' is being passed as an unsigned byte and is guaranteed to already be within the valid 0~15 range (so you don't have to have AND it with 15 or have assertions).

A straight-forward solution for this problem would something this:

u8 rev_nibble(u8 n) {
u8 ret = 0;
if (n&1) ret |= 8;
if (n&2) ret |= 4;
if (n&4) ret |= 2;
if (n&8) ret |= 1;
return ret;

A cooler solution is:

u8 rev_nibble(u8 n) {
return ((n & 1)<<3) | ((n & 2)<<1) | ((n & 4)>>1) | ((n & 8)>>3);

Both the above solutions however consist of multiple instructions to accomplish such a simple thing, better solutions might use either a switch or a table:

u8 rev_nibble(u8 n) {
static const u8 ret[16] = { 0x0,0x8,0x4,0xc,0x2,0xa,0x6,0xe,
0x1,0x9,0x5,0xd,0x3,0xb,0x7,0xf };
return ret[n];

Now this last solution uses a trick I came up with (although I'm sure others have to), which I call "Table Bitpacking".

Instead of using a table to store the return values, notice that all return values only take up 4 bits (a nibble), and there's 16 of them, so in total you only need 64bits of data for the table. So we can just use a 64bit immediate value for the table, and shift out the result instead!

u8 rev_nibble(u8 n) {
return (0xf7b3d591e6a2c480ull >> (n*4)) & 15;

Table bitpacking is cool because it can be used for many different permutations of tables; as long as the bits_per_choice * number_of_choices <= available_word_bits;
On 64bit platforms using 64bit tables should be fine, on 32bit platforms you'd want to stick to 32bit tables since 64bit shifts can be expensive.

Whether or not table bitpacking ends up being faster than the normal lookup table depends on the instruction set and the specific circumstances. But IMO its worth knowing the trick for possible future optimizations.

Now using our above function, we can do another common problem:
Problem: Reverse a byte!

Now that we know how to reverse a nibble, reversing a byte should be simple if we use the already existing rev_nibble function:

u8 rev_byte(u8 n) {
return (rev_nibble(n & 15) << 4) | rev_nibble(n>>4);

Note: There are plenty of other ways to reverse a byte, perhaps the fastest is generally a 256 entry lookup-table for each byte permutation (now that table would be 256*256 bits, so too large to use the Table Bitpacking trick on current machines).

Just thought I'd throw in one more cool way to reverse a nibble:

u8 rev_nibble(u8 n) {
return "aiemckgobjfndlhp"[n] - 'a';


  1. Yup, this is occasionally useful. I've used this trick on SPU - there you have 128-bit registers and fast bitshifts across the whole 128 bit value; making a fast lookup table requires 16 bytes per element because of load addressing restrictions, so it's usually a win both in time and space.

    1. hmm good point; like you mentioned some architectures such as the ps3's SPU processors require data to be aligned to some boundary, and so lookup tables can waste a lot of space, especially when you're only storing a couple bits per entry.
      I didn't know the SPU's had 128bit shifts, that's useful :D