For example, you can do:

// Normal Multiplication

void mul(int& x) {

x = x * 4;

}

// Optimized Multiplication

void mul(int& x) {

x = x << 2;

}

And for division:

// Normal Division

void div(unsigned int& x) {

x = x / 4;

}

// Optimized Division

void div(unsigned int& x) {

x = x >> 2;

}

Unlike the modulus optimization trick, the multiplication trick does work for both signed and unsigned variables, so the compiler is safe to optimize these cases for you.

In the case of the division trick there is a problem that may happen if you have a negative signed number. If you divide a negative integer and the result should have a remainder, then using shifts may give you the wrong result (for example, -1/2 should be 0, but -1>>1 is -1).

Compilers are able to optimize signed division by powers of two using shifts and interesting logic to account for the possible errors.

For the reason that compilers will make the necessary optimizations for you, it is generally better to leave the code as normal multiplications and divisions for readability and to prevent mistakes.

If you're dealing with low level assembly, you'd obviously want to be doing the shifts where possible instead of using mul and div.

Hi!

ReplyDeleteI've just discovered your blog and it's pretty interesting.

I just wanted to point something about this optimization trick.

The division being replaced by a shift is an optimization for sure.

However, the multiplication case "it depends".

The "mul" instruction doesn't take long. In fact there's little performance gain over shift.

However, the most important part is instruction pairing.

At least until the P4, when instruction pairing, many processors can't pair two shifts in a row. Unfortunately the instruction pairing section seems to have been removed from the Intel System Programming Manual; so I don't know if it remains true for recent processors.

In other words:

//These lines are executed concurrently. Is as fast as executing only one line

C = A * 2;

D = B * 2;

//These lines are executed concurrently. Is as fast as executing only one line

C = A * 2;

D = B << 1;

//These lines are serialized. Takes two cycles or more (slower)

C = A << 1;

D = B << 1;

The reason for this is that there is only one shifting unit found in the CPU pipelines.

If you're interested about this (and want to try if it still applies) check out the old Intel manual:

http://www.intel.com/design/pentiumii/manuals/243192.htm

Cheers

Dark Sylinc

Hey Dark Sylinc,

ReplyDeleteThanks for the interesting comment, although I'm pretty sure it no-longer applies on core 2 duo processors.

The reciprocal throughput of shifts is 0.5 cycles, while imul is 1 cycle.

The latency of the shift is 1 cycle, and the imul is 3.

So looks like its always better to use shifts for those processors.

Check out the Instruction tables manual here:

http://www.agner.org/optimize/

(Also every manual on that site is very nice so I recommend them all)