On Thu, Dec 15, 2022 at 02:38:28PM -0700, Nico Pache wrote: > To expand a little more on the analysis: > I computed the latency/throughput between <+24> and <+27> using > intel's manual (APPENDIX D): > > The bitmath solutions shows a total latency of 2.5 with a Throughput of 0.5. > The branch solution show a total latency of 4 and throughput of 1.5. > > Given this is not a tight loop, and the next instruction is requiring > the data computed, better (lower) latency is the more ideal situation. > > Just wanted to add that little piece :) I appreciate how hard you're working on this, but it really is straining at gnats ;-) For a modern cpu, the most important thing is cache misses and avoiding dirtying cachelines. Cycle counting isn't that important when an L3 cache miss takes 2000 (or more) cycles.