On Tue, Feb 10, 2015 at 2:46 PM, Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote: > > But I still fail to understand why my micro-benchmark is faster with > branch before store comparing to plain store. Very tight artificial loops like that tend to be horrible for performance analysis on modern cores, because you end up seeing mostly random microarchitectural details rather than any real performance. At a guess, since you write just one word per cacheline, what happens is that the store buffer continually fills up faster than the stores get drained to cache. So then the stores start staling. The extra load - that you expect to slow things down - likely ends up efectively just prefetching the hot L2 cacheline into L1 so that the store buffer then drains more cleanly. And both the load and the branch are effectively free, because the branch predicts perfectly, and the load just prefetches a cacheline that will have to be fetched for the subsequent store buffer drain anyway. And as you say, there is no cacheline bouncing issues, and the working set presumably fits in the caches - even if it doesn't fit in the L1. But that's all just a wild guess. It could equally easily be some very specific microarchitectural store buffer stall due to the simpler loop hitting just the right cycle count between stores. There are all kinds of random odd small corner-cases that are generally very very rare and hidden in the noise, but then a loop with just the right strides can happen to just hit them. It used to be *trivial* to hit things like address generation stalls, and even though modern Intel CPU's tend to be quite robust performance-wise, it's not true that they always handle any code sequence "perfectly". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html