tag:blogger.com,1999:blog-2543858175797658864.post5584431821541969380..comments2023-06-03T09:48:30.964-04:00Comments on Trash Can of Code: Small Comparison Optimization = Big Gainscottonvibeshttp://www.blogger.com/profile/15542648017139504620noreply@blogger.comBlogger4125tag:blogger.com,1999:blog-2543858175797658864.post-20811456341664046712012-06-18T06:26:56.290-04:002012-06-18T06:26:56.290-04:00Hey Martin, sorry for the late reply; check out th...Hey Martin, sorry for the late reply; check out the Instruction Tables manual here:<br />http://www.agner.org/optimize/<br /><br />It lists the throughput/latency for x86 instructions for a bunch of different cpu's. I also recommend downloading all the manuals on that page, Agner's manuals are very helpful for optimization.<br /><br />Also I noticed the manual says 6~32, not 6~38 as the latency for c2d Merom's divsd/divpd. So I guess I made a typo (or maybe an older manual I read had the wrong info). Anyways I'll go ahead and update my post.cottonvibeshttps://www.blogger.com/profile/15542648017139504620noreply@blogger.comtag:blogger.com,1999:blog-2543858175797658864.post-51091557380264484122012-06-04T20:08:43.536-04:002012-06-04T20:08:43.536-04:00You say "On Core 2 Duos, double floating-poin...You say "On Core 2 Duos, double floating-point division has a 6~38 clock cycle latency (exact latency depends on the operand values), whereas multiplication is only 5 cycles, and add/sub are 3 cycles."<br /><br />Can you point me to the place where you found these exact values?Martin Kunevhttps://www.blogger.com/profile/17823125560498321991noreply@blogger.comtag:blogger.com,1999:blog-2543858175797658864.post-152792159327347452010-09-04T01:33:43.090-04:002010-09-04T01:33:43.090-04:00You're right Dark Sylinc, I should have consid...You're right Dark Sylinc, I should have considered that.<br /><br />I was using Precise model, when i changed it to "Fast", the Div-version's generated output seemed to still do a div somewhere, but its runtime was almost exactly the same as the Mul-version.<br />Both were around ~2.47 seconds.<br /><br />The interesting thing is if i enable SSE2 optimizations (with the fast model), the compiler optimizes the Div-version to be ~2.053 seconds, while the Mul-version ends up taking ~2.291 seconds.<br /><br />From what i understand of the generated asm code, the div-version seems to be using a combination of div + mul instructions in a confusing loop, so that it can use the pipeline effectively.<br />Whereas the mul-version just uses muls, and thus is limited by the SSE mul throughtput.<br /><br />So I suppose the conclusion is, on "Fast" FP model, it doesn't matter what you use (i wouldn't really rely on the compiler to interleave divs/muls effectively like it did with this specific SSE2 test)<br /><br />Most people will use precise FP model since that is the default, so I believe when you care about speed you should still explicitly use the Mul-version.cottonvibeshttps://www.blogger.com/profile/15542648017139504620noreply@blogger.comtag:blogger.com,1999:blog-2543858175797658864.post-89869179418998245092010-09-03T17:58:23.632-04:002010-09-03T17:58:23.632-04:00I'm curious what compiler floating point "...I'm curious what compiler floating point "model" optimization option were you using.<br />In MSVC you have 3: strict precise and fast.<br /><br />I'm betting using "fast" the compiler automatically converts the division into the multiplication.<br /><br />Under precise and strict the compiler won't optimize that because not always A / B != C * B due to how floating points work.<br />But when using "fast" you're telling the compiler you don't care about those small differencesUnknownhttps://www.blogger.com/profile/18334118154923339217noreply@blogger.com