Re: [PATCH] mm: clear 1G pages with streaming stores on x86

"Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> · Thu, 12 Mar 2020 03:52:12 +0300

On Wed, Mar 11, 2020 at 04:32:47PM -0400, Arvind Sankar wrote:
> On Wed, Mar 11, 2020 at 02:32:41PM -0400, Arvind Sankar wrote:
> > On Wed, Mar 11, 2020 at 11:16:07AM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Mar 10, 2020 at 11:35:54PM -0400, Arvind Sankar wrote:
> > > > 
> > > > The rationale for MOVNTI instruction is supposed to be that it avoids
> > > > cache pollution. Aside from the bench that shows MOVNTI to be faster for
> > > > the move itself, shouldn't it have an additional benefit in not trashing
> > > > the CPU caches?
> > > > 
> > > > As string instructions improve, why wouldn't the same improvements be
> > > > applied to MOVNTI?
> > > 
> > > String instructions inherently more flexible. Implementation can choose
> > > caching strategy depending on the operation size (cx) and other factors.
> > > Like if operation is large enough and cache is full of dirty cache lines
> > > that expensive to free up, it can choose to bypass cache. MOVNTI is more
> > > strict on semantics and more opaque to CPU.
> > 
> > But with today's processors, wouldn't writing 1G via the string
> > operations empty out almost the whole cache? Or are there already
> > optimizations to prevent one thread from hogging the L3?
> 
> Also, currently the stringop is only done 4k at a time, so it would
> likely not trigger any future cache-bypassing optimizations in any case.

What I tried to say is that we need to be careful with this kind of
optimizations. We need to see a sizable improvement on something beyond
microbenchmark, ideally across multiple CPU microarchitectures.

-- 
 Kirill A. Shutemov