Re: [PATCH 0/2 v2] remove PF_MEMALLOC_NORECLAIM

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Tue, 3 Sep 2024 19:53:41 -0400

On Mon, Sep 02, 2024 at 06:32:40PM GMT, Kent Overstreet wrote:
> On Mon, Sep 02, 2024 at 02:52:52PM GMT, Andrew Morton wrote:
> > On Mon, 2 Sep 2024 05:53:59 -0400 Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
> > 
> > > On Mon, Sep 02, 2024 at 11:51:48AM GMT, Michal Hocko wrote:
> > > > The previous version has been posted in [1]. Based on the review feedback
> > > > I have sent v2 of patches in the same threat but it seems that the
> > > > review has mostly settled on these patches. There is still an open
> > > > discussion on whether having a NORECLAIM allocator semantic (compare to
> > > > atomic) is worthwhile or how to deal with broken GFP_NOFAIL users but
> > > > those are not really relevant to this particular patchset as it 1)
> > > > doesn't aim to implement either of the two and 2) it aims at spreading
> > > > PF_MEMALLOC_NORECLAIM use while it doesn't have a properly defined
> > > > semantic now that it is not widely used and much harder to fix.
> > > > 
> > > > I have collected Reviewed-bys and reposting here. These patches are
> > > > touching bcachefs, VFS and core MM so I am not sure which tree to merge
> > > > this through but I guess going through Andrew makes the most sense.
> > > > 
> > > > Changes since v1;
> > > > - compile fixes
> > > > - rather than dropping PF_MEMALLOC_NORECLAIM alone reverted eab0af905bfc
> > > >   ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN") suggested
> > > >   by Matthew.
> > > 
> > > To reiterate:
> > > 
> > 
> > It would be helpful to summarize your concerns.
> > 
> > What runtime impact do you expect this change will have upon bcachefs?
> 
> For bcachefs: I try really hard to minimize tail latency and make
> performance robust in extreme scenarios - thrashing. A large part of
> that is that btree locks must be held for no longer than necessary.
> 
> We definitely don't want to recurse into other parts of the kernel,
> taking other locks (i.e. in memory reclaim) while holding btree locks;
> that's a great way to stack up (and potentially multiply) latencies.
> 
> But gfp flags don't work with vmalloc allocations (and that's unlikely
> to change), and we require vmalloc fallbacks for e.g. btree node
> allocation. That's the big reason we want MEMALLOC_PF_NORECLAIM.
> 
> Besides that, it's just cleaner, memalloc flags are the direction we
> want to be moving in, and it's going to be necessary if we ever want to
> do a malloc() that doesn't require a gfp flags parameter. That would be
> a win for safety and correctness in the kernel, and it's also likely
> required for proper Rust support.
> 
> And the "GFP_NOFAIL must not fail" argument makes no sense, because a
> failing a GFP_NOFAIL allocation is the only sane thing to do if the
> allocation is buggy (too big, i.e. resulting from an integer overflow
> bug, or wrong context). The alternatives are at best never returning
> (stuck unkillable process), or a scheduling while atomic bug, or Michal
> was even proposing killing the process (handling it like a BUG()!).
> 
> But we don't use BUG_ON() for things that we can't prove won't happen in
> the wild if we can write an error path.
> 
> That is, PF_MEMALLOC_NORECLAIM lets us turn bugs into runtime errors.

BTW, one of the reasons I've been giving this issue so much attention is
because of filesystem folks mentioning that they want GFP_NOFAIL
semantics more widely, and I actually _don't_ think that's a crazy idea,
provided we go about it the right way.

Not having error paths is right out; many allocations when you start to
look through more obscure code have sizes that are controlled by
userspace, so we'd be opening ourselves up to trivially expoitable
security bugs.

However, if we agreed that GFP_NOFAIL meant "only fail if it is not
possible to satisfy this allocation" (and I have been arguing that that
is the only sane meaning) - then that could lead to a lot of error paths
getting simpler.

Because there are a lot of places where there's essentially no good
reason to bubble up an -ENOMEM to userspace; if we're actually out of
memory the current allocation is just one out of many and not
particularly special, better to let the oom killer handle it...

So the error paths would be more along the lines of "there's a bug, or
userspace has requested something crazy, just shut down gracefully".

While we're at it, the definition of what allocation size is "too big"
is something we'd want to look at. Right now it's hardcoded to INT_MAX
for non GFP_NOFAIL and (I believe) 2 pages for GFP_NOFAL, we might want
to consider doing something based on total memory in the machine and
have the same limit apply to both...