On Wed, Mar 11, 2020 at 04:32:47PM -0400, Arvind Sankar wrote: > On Wed, Mar 11, 2020 at 02:32:41PM -0400, Arvind Sankar wrote: > > On Wed, Mar 11, 2020 at 11:16:07AM +0300, Kirill A. Shutemov wrote: > > > On Tue, Mar 10, 2020 at 11:35:54PM -0400, Arvind Sankar wrote: > > > > > > > > The rationale for MOVNTI instruction is supposed to be that it avoids > > > > cache pollution. Aside from the bench that shows MOVNTI to be faster for > > > > the move itself, shouldn't it have an additional benefit in not trashing > > > > the CPU caches? > > > > > > > > As string instructions improve, why wouldn't the same improvements be > > > > applied to MOVNTI? > > > > > > String instructions inherently more flexible. Implementation can choose > > > caching strategy depending on the operation size (cx) and other factors. > > > Like if operation is large enough and cache is full of dirty cache lines > > > that expensive to free up, it can choose to bypass cache. MOVNTI is more > > > strict on semantics and more opaque to CPU. > > > > But with today's processors, wouldn't writing 1G via the string > > operations empty out almost the whole cache? Or are there already > > optimizations to prevent one thread from hogging the L3? > > Also, currently the stringop is only done 4k at a time, so it would > likely not trigger any future cache-bypassing optimizations in any case. What I tried to say is that we need to be careful with this kind of optimizations. We need to see a sizable improvement on something beyond microbenchmark, ideally across multiple CPU microarchitectures. -- Kirill A. Shutemov