Re: [RFC-PATCH 1/2] mm: Add __GFP_NO_LOCKS flag

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Fri, 14 Aug 2020 11:01:41 -0700

On Fri, Aug 14, 2020 at 04:06:04PM +0200, Michal Hocko wrote:
> On Fri 14-08-20 06:34:50, Paul E. McKenney wrote:
> > On Fri, Aug 14, 2020 at 02:48:32PM +0200, Michal Hocko wrote:
> > > On Fri 14-08-20 14:15:44, Uladzislau Rezki wrote:
> > > > > On Thu 13-08-20 19:09:29, Thomas Gleixner wrote:
> > > > > > Michal Hocko <mhocko@xxxxxxxx> writes:
> > > > > [...]
> > > > > > > Why should we limit the functionality of the allocator for something
> > > > > > > that is not a real problem?
> > > > > > 
> > > > > > We'd limit the allocator for exactly ONE new user which was aware of
> > > > > > this problem _before_ the code hit mainline. And that ONE user is
> > > > > > prepared to handle the fail.
> > > > > 
> > > > > If we are to limit the functionality to this one particular user then
> > > > > I would consider a dedicated gfp flag a huge overkill. It would be much
> > > > > more easier to have a preallocated pool of pages and use those and
> > > > > completely avoid the core allocator. That would certainly only shift the
> > > > > complexity to the caller but if it is expected there would be only that
> > > > > single user then it would be probably better than opening a can of worms
> > > > > like allocator usable from raw spin locks.
> > > > > 
> > > > Vlastimil raised same question earlier, i answered, but let me answer again:
> > > > 
> > > > It is hard to achieve because the logic does not stick to certain static test
> > > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based
> > > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer
> > > > thread frees them.
> > > 
> > > How many pages are talking about - ball park? 100s, 1000s?
> > 
> > Under normal operation, a couple of pages per CPU, which would make
> > preallocation entirely reasonable.  Except that if someone does something
> > that floods RCU callbacks (close(open) in a tight userspace loop, for but
> > one example), then 2000 per CPU might not be enough, which on a 64-CPU
> > system comes to about 500MB.  This is beyond excessive for preallocation
> > on the systems I am familiar with.
> > 
> > And the flooding case is where you most want the reclamation to be
> > efficient, and thus where you want the pages.
> 
> I am not sure the page allocator would help you with this scenario
> unless you are on very large machines. Pagesets scale with the available
> memory and percpu_pagelist_fraction sysctl (have a look at
> pageset_set_high_and_batch). It is roughly 1000th of the zone size for
> each zone. You can check that in /proc/vmstat (my 8G machine)

Small systems might have ~64G.  The medium-sized systems might have
~250G.  There are a few big ones that might have 1.5T.  None of the
/proc/vmstat files from those machines contain anything resembling
the list below, though.

> Node 0, zone      DMA
> Not interesting at all
> Node 0, zone    DMA32
>   pagesets
>     cpu: 0
>               count: 242
>               high:  378
>               batch: 63
>     cpu: 1
>               count: 355
>               high:  378
>               batch: 63
>     cpu: 2
>               count: 359
>               high:  378
>               batch: 63
>     cpu: 3
>               count: 366
>               high:  378
>               batch: 63
> Node 0, zone   Normal
>   pagesets
>     cpu: 0
>               count: 359
>               high:  378
>               batch: 63
>     cpu: 1
>               count: 241
>               high:  378
>               batch: 63
>     cpu: 2
>               count: 297
>               high:  378
>               batch: 63
>     cpu: 3
>               count: 227
>               high:  378
>               batch: 63
> 
> Besides that do you need to be per-cpu? Having 1000 pages available and
> managed under your raw spinlock should be good enough already no?

It needs to be almost entirely per-CPU for performance reasons.  Plus
a user could do a tight close(open()) loop on each CPU.

> > This of course raises the question of how much memory the lockless caches
> > contain, but fortunately these RCU callback flooding scenarios also
> > involve process-context allocation of the memory that is being passed
> > to kfree_rcu().  That allocation should keep the lockless caches from
> > going empty in the common case, correct?
> 
> Yes, those are refilled both on the allocation/free paths. But you
> cannot really rely on that to happen early enough.

So the really ugly scenarios with the tight loops normally allocate
something and immediately either call_rcu() or kfree_rcu() it.
But you are right, someone doing "rm -rf" on a large file tree
with lots of small files might not be doing that many allocations.

> Do you happen to have any numbers that would show the typical usage
> and how often the slow path has to be taken becase pcp lists are
> depleted? In other words even if we provide a functionality to give
> completely lockless way to allocate memory how useful that is?

Not yet, but let's see what we can do.

							Thanx, Paul