Saturday, July 31, 2010

Small Comparison Optimization = Big Gains

This is a small and pretty well known trick, but its worth mentioning since a lot of high-level coders don't seem to realize it, and it could result in some big speedups.

Lets take a look at some random code snippet:

int main() {
const double len = 10000;
const double fract = 1.0/3.0;
for(double i = 1; i < len; i++) {
for(double j = 1; ; j++) {
if (j/i < fract) {
cout << j << " / " << i << endl;
}
else break;
}}
}


What this code is doing is printing out all fractions that are less than 1/3, for all denominators less than 10,000.

Notice anything that we can optimize?

Well if you take a look at the if-statement, notice we are dividing j by i, and then doing a comparison. But as we should know by now, division is a slow operation.

On Core 2 Duos, double floating-point division has a 6~32 clock cycle latency (exact latency depends on the operand values), whereas multiplication is only 5 cycles, and add/sub are 3 cycles.

So it would be nice if we can omit that division from the comparison... oh wait we can!

Using simple mathematics we can do:

j/i < fract =>
i * (j/i) < fract * i =>
j < fract * i


Now we got the same equation, but we were able to remove the division completely!
So in the worst-case that our division was taking ~32 clock cycles, this new optimization is ~6 times faster.

Now lets see this code in action.
Here is the same problem as above, except we've now increased 'len' to 100,000, and instead of printing out the results, we just count the fractions (this way we're not limited by console-write speed):

int main() {
const double len = 100000;
const double fract = 1.0/3.0;
int ret = 0;
for(double i = 1; i < len; i++) {
for(double j = 1; ; j++) {
if (j < fract * i) {
ret++;
}
else break;
}}
cout << ret << endl;
}



When I ran the above code using "if (j/i < fract)", it took ~19 seconds to finish on my PC.
When I ran the optimized version using "if (j < fract*i), it took ~2 seconds to finish executing! That's an even bigger speedup than predicted.

I hope this shows why optimization is important.

One last thing to note:
When using integers, you might not always want to do this trick.
j/i > x, is not necessarily the same as j > x*i when using integers, since the first one truncates the value (obviously, since integers don't hold the fractional part).
For example:

int main() {
int i = 21;
int j = 2;

if (i/j > 10) cout << "First one" << endl;
if (i > 10*j) cout << "Second one" << endl;
}


The above example only prints out "Second one" since we're using integers (if i and j were doubles, it would print out "First one" and "Second one").

This isn't really a negative, since when using integers you probably want to be using the second-method anyways, so you don't have to deal with converting to floats; in that case, you not only would save time by not doing a division, but you would also save time by not doing an int-to-float conversion.

4 comments:

  1. I'm curious what compiler floating point "model" optimization option were you using.
    In MSVC you have 3: strict precise and fast.

    I'm betting using "fast" the compiler automatically converts the division into the multiplication.

    Under precise and strict the compiler won't optimize that because not always A / B != C * B due to how floating points work.
    But when using "fast" you're telling the compiler you don't care about those small differences

    ReplyDelete
  2. You're right Dark Sylinc, I should have considered that.

    I was using Precise model, when i changed it to "Fast", the Div-version's generated output seemed to still do a div somewhere, but its runtime was almost exactly the same as the Mul-version.
    Both were around ~2.47 seconds.

    The interesting thing is if i enable SSE2 optimizations (with the fast model), the compiler optimizes the Div-version to be ~2.053 seconds, while the Mul-version ends up taking ~2.291 seconds.

    From what i understand of the generated asm code, the div-version seems to be using a combination of div + mul instructions in a confusing loop, so that it can use the pipeline effectively.
    Whereas the mul-version just uses muls, and thus is limited by the SSE mul throughtput.

    So I suppose the conclusion is, on "Fast" FP model, it doesn't matter what you use (i wouldn't really rely on the compiler to interleave divs/muls effectively like it did with this specific SSE2 test)

    Most people will use precise FP model since that is the default, so I believe when you care about speed you should still explicitly use the Mul-version.

    ReplyDelete
  3. You say "On Core 2 Duos, double floating-point division has a 6~38 clock cycle latency (exact latency depends on the operand values), whereas multiplication is only 5 cycles, and add/sub are 3 cycles."

    Can you point me to the place where you found these exact values?

    ReplyDelete
  4. Hey Martin, sorry for the late reply; check out the Instruction Tables manual here:
    http://www.agner.org/optimize/

    It lists the throughput/latency for x86 instructions for a bunch of different cpu's. I also recommend downloading all the manuals on that page, Agner's manuals are very helpful for optimization.

    Also I noticed the manual says 6~32, not 6~38 as the latency for c2d Merom's divsd/divpd. So I guess I made a typo (or maybe an older manual I read had the wrong info). Anyways I'll go ahead and update my post.

    ReplyDelete