Re: [PATCH] bcachefs: Switch to memalloc_flags_do() for vmalloc allocations

Dave Chinner <david@xxxxxxxxxxxxx> · Sun, 1 Sep 2024 13:35:29 +1000

On Fri, Aug 30, 2024 at 05:14:28PM +0800, Yafang Shao wrote:
> On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote:
> > > Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe
> > > according to call path and allocation size, you still need to be
> > > checking for failure - in the same way that you shouldn't be using
> > > BUG_ON() if you cannot prove that the condition won't occur in real wold
> > > usage.
> >
> > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years
> > now. This was the default Irix kernel allocator behaviour (it had a
> > forwards progress guarantee and would never fail allocation unless
> > told it could do so). We've been using the same "guaranteed not to
> > fail" semantics on Linux since the original port started 25 years
> > ago via open-coded loops.
> >
> > IOWs, __GFP_NOFAIL semantics have been production tested for a
> > couple of decades on Linux via XFS, and nobody here can argue that
> > XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL
> > as it is used by XFS is reliable and lives up to the "will not fail"
> > guarantee that it is supposed to have.
> >
> > Fundamentally, __GFP_NOFAIL came about to replace the callers doing
> >
> >         do {
> >                 p = kmalloc(size);
> >         while (!p);
> >
> > so that they blocked until memory allocation succeeded. The call
> > sites do not check for failure, because -failure never occurs-.
> >
> > The MM devs want to have visibility of these allocations - they may
> > not like them, but having __GFP_NOFAIL means it's trivial to audit
> > all the allocations that use these semantics.  IOWs, __GFP_NOFAIL
> > was created with an explicit guarantee that it -will not fail- for
> > normal allocation contexts so it could replace all the open-coded
> > will-not-fail allocation loops..
> >
> > Given this guarantee, we recently removed these historic allocation
> > wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the
> > allocation call sites. There's nearly a hundred memory allocation
> > locations in XFS that are tagged with __GFP_NOFAIL.
> >
> > If we're now going to have the "will not fail" guarantee taken away
> > from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can
> > it be used anywhere else that a "will not fail" guarantee it
> > required.
> >
> > Put simply: __GFP_NOFAIL will be rendered completely useless if it
> > can fail due to external scoped memory allocation contexts.  This
> > will force us to revert all __GFP_NOFAIL allocations back to
> > open-coded will-not-fail loops.
> >
> > This is not a step forwards for anyone.
> 
> Hello Dave,
> 
> I've noticed that XFS has increasingly replaced kmem_alloc() with
> __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of
> __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41
> occurrences. In kmem_alloc(), there's an explicit
> memalloc_retry_wait() to throttle the allocator under heavy memory
> pressure, which aligns with your filesystem design. However, using
> __GFP_NOFAIL removes this throttling mechanism, potentially causing
> issues when the system is under heavy memory load. I'm concerned that
> this shift might not be a beneficial trend.

AIUI, the memory allocation looping has back-offs already built in
to it when memory reserves are exhausted and/or reclaim is
congested.

e.g:

get_page_from_freelist()
  (zone below watermark)
  node_reclaim()
    __node_reclaim()
      shrink_node()
        reclaim_throttle()

And the call to recalim_throttle() will do the equivalent of
memalloc_retry_wait() (a 2ms sleep).

> We have been using XFS for our big data servers for years, and it has
> consistently performed well with older kernels like 4.19.y. However,
> after upgrading all our servers from 4.19.y to 6.1.y over the past two
> years, we have frequently encountered livelock issues caused by memory
> exhaustion. To mitigate this, we've had to limit the RSS of
> applications, which isn't an ideal solution and represents a worrying
> trend.

If userspace uses all of memory all the time, then the best the
kernel can do is slowly limp along. Preventing userspace from
overcommitting memory to the point of OOM is the only way to avoid
these "userspace space wants more memory than the machine physically
has" sorts of issues. i.e. this is not a problem that the kernel
code can solve short of randomly killing userspace applications...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx