Re: [PATCH 0/2 v2] remove PF_MEMALLOC_NORECLAIM

Michal Hocko <mhocko@xxxxxxxx> · Wed, 4 Sep 2024 18:46:00 +0200

On Wed 04-09-24 12:05:56, Kent Overstreet wrote:
> On Wed, Sep 04, 2024 at 09:14:29AM GMT, Michal Hocko wrote:
> > On Tue 03-09-24 19:53:41, Kent Overstreet wrote:
> > [...]
> > > However, if we agreed that GFP_NOFAIL meant "only fail if it is not
> > > possible to satisfy this allocation" (and I have been arguing that that
> > > is the only sane meaning) - then that could lead to a lot of error paths
> > > getting simpler.
> > >
> > > Because there are a lot of places where there's essentially no good
> > > reason to bubble up an -ENOMEM to userspace; if we're actually out of
> > > memory the current allocation is just one out of many and not
> > > particularly special, better to let the oom killer handle it...
> > 
> > This is exactly GFP_KERNEL semantic for low order allocations or
> > kvmalloc for that matter. They simply never fail unless couple of corner
> > cases - e.g. the allocating task is an oom victim and all of the oom
> > memory reserves have been consumed. This is where we call "not possible
> > to allocate".
> 
> *nod*
> 
> Which does beg the question of why GFP_NOFAIL exists.

Exactly for the reason that even rare failure is not acceptable and
there is no way to handle it other than keep retrying. Typical code was 
	while (!(ptr = kmalloc()))
		;

Or the failure would be much more catastrophic than the retry loop
taking unbound amount of time.

> > > So the error paths would be more along the lines of "there's a bug, or
> > > userspace has requested something crazy, just shut down gracefully".
> > 
> > How do you expect that to be done? Who is going to go over all those
> > GFP_NOFAIL users? And what kind of guide lines should they follow? It is
> > clear that they believe they cannot handle the failure gracefully
> > therefore they have requested GFP_NOFAIL. Many of them do not have
> > return value to return.
> 
> They can't handle the allocatian failure and continue normal operation,
> but that's entirely different from not being able to handle the
> allocation failure at all - it's not hard to do an emergency shutdown,
> that's a normal thing for filesystems to do.
> 
> And if you scan for GFP_NOFAIL uses in the kernel, a decent number
> already do just that.

It's been quite some time since I've looked the last time. And I am not
saying all the existing ones really require something as strong as
GFP_NOFAIL semantic. If they could be dropped then great! The fewer we
have the better.

But the point is there are some which _do_ need this. We have discussed
that in other email thread where you have heard why XFS and EXT4 does
that and why they are not going to change that model. 

For those users we absolutely need a predictable and well defined
behavior because they know what they are doing.

[...]

> But as a matter of policy going forward, yes we should be saying that
> even GFP_NOFAIL allocations should be checking for -ENOMEM.

I argue that such NOFAIL semantic has no well defined semantic and legit
users are forced to do
	while (!(ptr = kmalloc(GFP_NOFAIL))) ;
or
	BUG_ON(!(ptr = kmalloc(GFP_NOFAIL)));

So it has no real reason to exist.

We at the allocator level have 2 choices.  Either we tell users they
will not get GFP_NOFAIL and you just do the above or we provide NOFAIL
which really guarantees that there is no failure even if that means the
allocation gets unbounded amount of time. The latter have a slight
advantage because a) you can identify those callers more easily and b)
the allocator can do some heuristics to help those allocations.

We can still discuss how to handle unsupported cases (like GFP_ATOMIC |
__GFP_NOFAIL or kmalloc($UNCHECKED_USER_INPUT_THAT_IS_TOO_LARGE, __GFP_NOFAIL))
but the fact of the Linux kernel is that we have legit users and we need
to optimize for them.

> > Yes, we need to define some reasonable maximum supported sizes. For the
> > page allocator this has been order > 1 and we considering we have a
> > warning about those requests for years without a single report then we
> > can assume we do not have such abusers. for kvmalloc to story is
> > different. Current INT_MAX is just not any practical limit. Past
> > experience says that anything based on the amount of memory just doesn't
> > work (e.g. hash table sizes that used to that scaling and there are
> > other examples). So we should be practical here and look at existing
> > users and see what they really need and put a cap above that.
> 
> Not following what you're saying about hash tables? Hash tables scale
> roughly with the amount of system memory/workingset.

I do not have sha handy but I do remember dcache hashtable scaling with
the amount of memory in the past and that led to GBs of memory allocated
on TB systems. This is not the case anymore I just wanted to mention
that scaling with the amount of memory can get really wrong easily.

> But it seems to me that the limit should be lower if you're on e.g. a 2
> GB machine (not failing with a warning, just failing immediately rather
> than oom killing a bunch of stuff first) - and it's going to need to be
> raised above INT_MAX as large memory machines keep growing, I keep
> hitting it in bcachefs fsck code.

Do we actual usecase that would require more than couple of MB? The
amount of memory wouldn't play any actual role then.

-- 
Michal Hocko
SUSE Labs