On Thu, Dec 15, 2022 at 2:47 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Thu, Dec 15, 2022 at 02:38:28PM -0700, Nico Pache wrote: > > To expand a little more on the analysis: > > I computed the latency/throughput between <+24> and <+27> using > > intel's manual (APPENDIX D): > > > > The bitmath solutions shows a total latency of 2.5 with a Throughput of 0.5. > > The branch solution show a total latency of 4 and throughput of 1.5. > > > > Given this is not a tight loop, and the next instruction is requiring > > the data computed, better (lower) latency is the more ideal situation. > > > > Just wanted to add that little piece :) > > I appreciate how hard you're working on this, but it really is straining > at gnats ;-) For a modern cpu, the most important thing is cache misses > and avoiding dirtying cachelines. Cycle counting isn't that important > when an L3 cache miss takes 2000 (or more) cycles. Haha yeah I figured so once I saw the results, but I figured I'd share. We have HPC systems in the TiB of memory so sometimes gnats matter ;p The 2-3 extra cycles may turn into 2million extra cycles on a 2TiB system full of THPs-- I guess that's not a significant amount of cycles either in the grand scheme of things. Cheers, -- Nico >