Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 13 Feb 2024 09:07:03 +1100

On Mon, Feb 12, 2024 at 02:30:02PM -0500, Kent Overstreet wrote:
> On Mon, Feb 12, 2024 at 03:35:33PM +1100, Dave Chinner wrote:
> > On Sun, Feb 11, 2024 at 09:06:33PM -0500, Kent Overstreet wrote:
> > > That's because in general most code in the IO path knows how to make
> > > effective use of biosets and mempools (which may take some work! you
> > > have to ensure that you're always able to make forward progress when
> > > memory is limited, and in particular that you don't double allocate from
> > > the same mempool if you're blocking the first allocation from
> > > completing/freeing).
> > 
> > Yes, I understand this, and that's my point: NOIO context tends to
> > be able to use mempools and other mechanisms to prevent memory
> > allocation failure, not NOFAIL.
> > 
> > The IO layers are request based and that enables one-in, one out
> > allocation pools that can guarantee single IO progress. That's all
> > the IO layers need to guarantee to the filesystems so that forwards
> > progress can always be made until memory pressure.
> > 
> > However, filesystems cannot guarantee "one in, one out" allocation
> > behaviour. A transaction can require a largely unbound number of
> > memory allocations to succeed to make progress through to
> > completion, and so things like mempools -cannot be used- to prevent
> > memory allocation failures whilst providing a forwards progress
> > guarantee.
> 
> I don't see that that's actually true. There's no requirement that
> arbitrarily large IOs must be done atomically, within a single
> transaction:

*cough*

metadata Io needs to be set up, issued and completed before the
transaction can make progress, and then the transaction will hold
onto that metadata until it is committed and unlocked.

That means we hold every btree buffer we walk along a path in the
transaction, and if the cache is cold it means we might need to
allocate and read dozens of metadata buffers in a single
transaction.

> there's been at most talk of eventually doing atomic writes
> through the pagecache, but the people on that can't even finish atomic
> writes through the block layer, so who knows when that'll happen.

What's atomic data writes got to do with metadata transaction
contexts?

> I generally haven't been running into filesyste operations that require
> an unbounded number of memory allocations (reflink is a bit of an
> exception in the current bcachefs code, and even that is just a
> limitation I could solve if I really wanted to...)

Step outside of bcachefs for a minute, Kent. Not everything works
the same way or has the same constraints and/or freedoms as the
bcachefs implementation....

> > Hence a NOFAIL scope if useful at the filesystem layer for
> > filesystem objects to ensure forwards progress under memory
> > pressure, but it is compeltely unnecessary once we transition to the
> > IO layer where forwards progress guarantees ensure memory allocation
> > failures don't impede progress.
> > 
> > IOWs, we only need NOFAIL at the NOFS layers, not at the NOIO
> > layers. The entry points to the block layer should transition the
> > task to NOIO context and restore the previous context on exit. Then
> > it becomes relatively trivial to apply context based filtering of
> > allocation behaviour....
> > 
> > > > i.e NOFAIL scopes are not relevant outside the subsystem that sets
> > > > it.  Hence we likely need helpers to clear and restore NOFAIL when
> > > > we cross an allocation context boundaries. e.g. as we cross from
> > > > filesystem to block layer in the IO stack via submit_bio(). Maybe
> > > > they should be doing something like:
> > > > 
> > > > 	nofail_flags = memalloc_nofail_clear();
> > > 
> > > NOFAIL is not a scoped thing at all, period; it is very much a
> > > _callsite_ specific thing, and it depends on whether that callsite has a
> > > fallback.
> > 
> > *cough*
> > 
> > As I've already stated, NOFAIL allocation has been scoped in XFS for
> > the past 20 years.
> > 
> > Every memory allocation inside a transaction *must* be NOFAIL unless
> > otherwise specified because memory allocation inside a dirty
> > transaction is a fatal error.
> 
> Say you start to incrementally mempoolify your allocations inside a
> transaction - those mempools aren't going to do anything if there's a
> scoped NOFAIL, and sorting that out is going to get messy fast.

How do you mempoolify something that can have thousands of
concurrent contexts with in-flight objects across multiple
filesystems that might get stashed in a LRU rather than freed when
finished with? Not to mention that each context has an unknown
demand on the mempool before it can complete and return objects to
the mempool?

We talked about this a decade ago at LSFMM (2014, IIRC) with the MM
developers and nothing about mempools has changed since.

> > However, that scoping has never been
> > passed to the NOIO contexts below the filesytsem - it's scoped
> > purely within the filesystem itself and doesn't pass on to other
> > subsystems the filesystem calls into.
> 
> How is that managed?

Our own internal memory allocation wrappers. go look at what remains
in fs/xfs/kmem.c. See the loop there in kmem_alloc()? It's
guaranteeing NOFAIL behaviour unless KM_MAYFAIL is passed to the
allocation. Look at xlog_kvmalloc() - same thing, except it is
always run within transaction context (the "xlog" prefix is a
giveaway) and so will block until allocation succeeds.

IOWs, we scoped everything by having our own internal allocation
wrappers than never fail. In removing these wrappers (which is where
my "scoped NOFAIL" comments in this thread originated from) the
proliferation of __GFP_NOFAIL annotations across meant we went from
pretty much zero usage of __GFP_NOFAIL to having almost a hundred
allocation sites annotated with __GFP_NOFAIL. And it adds a
maintenance landmine for us - we now have to ensure that all future
allocations within a transaction scope are marked __GFP_NOFAIL.

Hence I'm looking for ways to move this NOFAIL scoping into the
generic memory allocation code to replace the scoping we current
have via subsystem-specific allocation wrappers.

> > > > > - NOWAIT - as said already, we need to make sure we're not turning an
> > > > > allocation that relied on too-small-to-fail into a null pointer exception or
> > > > > BUG_ON(!page).
> > > > 
> > > > Agreed. NOWAIT is removing allocation failure constraints and I
> > > > don't think that can be made to work reliably. Error injection
> > > > cannot prove the absence of errors  and so we can never be certain
> > > > the code will always operate correctly and not crash when an
> > > > unexepected allocation failure occurs.
> > > 
> > > You saying we don't know how to test code?
> > 
> > Yes, that's exactly what I'm saying.
> > 
> > I'm also saying that designing algorithms that aren't fail safe is
> > poor design. If you get it wrong and nothing bad can happen as a
> > result, then the design is fine.
> > 
> > But if the result of missing something accidentally is that the
> > system is guaranteed to crash when that is hit, then failure is
> > guaranteed and no amount of testing will prevent that failure from
> > occurring.
> > 
> > And we suck at testing, so we absolutely need to design fail
> > safe algorithms and APIs...
> 
> GFP_NOFAIL dosen't magically make your algorithm fail safe, though.

I never said it did - this part of the conversation was about the
failure prone design of proposed -NOWAIT- scoping, not about
trying to codify a generic mechanism for scoped behaviour we've been
using successfully for the past 20 years...

> Suren and I are trying to get memory allocation profiling into 6.9, and
> I'll be posting the improved fault injection immediately afterwards -
> this is what I used to use to make sure every allocation failure path in
> the bcachefs predecessor was tested. Hopefully that'll make things
> easier...

Tha all sounds good, but after a recent spate of "CI and post
integration testing didn't uncover fs bugs that fstests reproduced
until after tested kernels were released to test systems", I have
little confidence in the ability of larger QA organisations, let
alone individuals, to test filesystem code adequately when they are
constrained either by time or resources.

The fact of the matter is that we are all constrained by time
and resources. Hence adding more new testing methods that add time
and resources to validate new code and backports of fixes to the
test matrix overhead does nothing to improve that situation.

We need to start designing our code in a way that doesn't require
extensive testing to validate it as correct. If the only way to
validate new code is correct is via stochastic coverage via error
injection, then that is a clear sign we've made poor design choices
along the way.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx