> -----Original Message----- > From: linux-kernel-owner@xxxxxxxxxxxxxxx <linux-kernel- > owner@xxxxxxxxxxxxxxx> On Behalf Of Arvind Sankar > Sent: Wednesday, March 11, 2020 1:33 PM > To: Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> > Cc: Arvind Sankar <nivedita@xxxxxxxxxxxx>; Cannon Matthews > <cannonmatthews@xxxxxxxxxx>; Matthew Wilcox <willy@xxxxxxxxxxxxx>; > Andi Kleen <ak@xxxxxxxxxxxxxxx>; Michal Hocko <mhocko@xxxxxxxxxx>; > Mike Kravetz <mike.kravetz@xxxxxxxxxx>; Andrew Morton <akpm@linux- > foundation.org>; David Rientjes <rientjes@xxxxxxxxxx>; Greg Thelen > <gthelen@xxxxxxxxxx>; Salman Qazi <sqazi@xxxxxxxxxx>; linux- > mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx > Subject: Re: [PATCH] mm: clear 1G pages with streaming stores on x86 > > On Wed, Mar 11, 2020 at 11:16:07AM +0300, Kirill A. Shutemov wrote: > > On Tue, Mar 10, 2020 at 11:35:54PM -0400, Arvind Sankar wrote: > > > > > > The rationale for MOVNTI instruction is supposed to be that it > avoids > > > cache pollution. Aside from the bench that shows MOVNTI to be > faster for > > > the move itself, shouldn't it have an additional benefit in not > trashing > > > the CPU caches? > > > > > > As string instructions improve, why wouldn't the same > improvements be > > > applied to MOVNTI? > > > > String instructions inherently more flexible. Implementation can > choose > > caching strategy depending on the operation size (cx) and other > factors. > > Like if operation is large enough and cache is full of dirty cache > lines > > that expensive to free up, it can choose to bypass cache. MOVNTI is > more > > strict on semantics and more opaque to CPU. > > But with today's processors, wouldn't writing 1G via the string > operations empty out almost the whole cache? Or are there already > optimizations to prevent one thread from hogging the L3? > > If we do want to just use the string operations, it seems like the > clear_page routines should just call memset instead of duplicating > it. > The last time I checked, glibc memcpy() chose non-temporal stores based on transfer size, L3 cache size, and the number of cores. For example, with glibc-2.216-16.fc27 (August 2017), on a Broadwell system with E5-2699 36 cores 45 MiB L3 cache, non-temporal stores only start to be used above 36 MiB.