Comments on Trash Can of Code: Small Comparison Optimization = Big Gains

Hey Martin, sorry for the late reply; check out th...

2012-06-18T06:26:56.290-04:00

Hey Martin, sorry for the late reply; check out the Instruction Tables manual here:
http://www.agner.org/optimize/

It lists the throughput/latency for x86 instructions for a bunch of different cpu's. I also recommend downloading all the manuals on that page, Agner's manuals are very helpful for optimization.

Also I noticed the manual says 6~32, not 6~38 as the latency for c2d Merom's divsd/divpd. So I guess I made a typo (or maybe an older manual I read had the wrong info). Anyways I'll go ahead and update my post.

You say "On Core 2 Duos, double floating-poin...

2012-06-04T20:08:43.536-04:00

You say "On Core 2 Duos, double floating-point division has a 6~38 clock cycle latency (exact latency depends on the operand values), whereas multiplication is only 5 cycles, and add/sub are 3 cycles."

Can you point me to the place where you found these exact values?

You're right Dark Sylinc, I should have consid...

2010-09-04T01:33:43.090-04:00

You're right Dark Sylinc, I should have considered that.

I was using Precise model, when i changed it to "Fast", the Div-version's generated output seemed to still do a div somewhere, but its runtime was almost exactly the same as the Mul-version.
Both were around ~2.47 seconds.

The interesting thing is if i enable SSE2 optimizations (with the fast model), the compiler optimizes the Div-version to be ~2.053 seconds, while the Mul-version ends up taking ~2.291 seconds.

From what i understand of the generated asm code, the div-version seems to be using a combination of div + mul instructions in a confusing loop, so that it can use the pipeline effectively.
Whereas the mul-version just uses muls, and thus is limited by the SSE mul throughtput.

So I suppose the conclusion is, on "Fast" FP model, it doesn't matter what you use (i wouldn't really rely on the compiler to interleave divs/muls effectively like it did with this specific SSE2 test)

Most people will use precise FP model since that is the default, so I believe when you care about speed you should still explicitly use the Mul-version.

I'm curious what compiler floating point "...

2010-09-03T17:58:23.632-04:00

I'm curious what compiler floating point "model" optimization option were you using.
In MSVC you have 3: strict precise and fast.

I'm betting using "fast" the compiler automatically converts the division into the multiplication.

Under precise and strict the compiler won't optimize that because not always A / B != C * B due to how floating points work.
But when using "fast" you're telling the compiler you don't care about those small differences