Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Mon, 12 Feb 2024 14:30:02 -0500

On Mon, Feb 12, 2024 at 03:35:33PM +1100, Dave Chinner wrote:
> On Sun, Feb 11, 2024 at 09:06:33PM -0500, Kent Overstreet wrote:
> > That's because in general most code in the IO path knows how to make
> > effective use of biosets and mempools (which may take some work! you
> > have to ensure that you're always able to make forward progress when
> > memory is limited, and in particular that you don't double allocate from
> > the same mempool if you're blocking the first allocation from
> > completing/freeing).
> 
> Yes, I understand this, and that's my point: NOIO context tends to
> be able to use mempools and other mechanisms to prevent memory
> allocation failure, not NOFAIL.
> 
> The IO layers are request based and that enables one-in, one out
> allocation pools that can guarantee single IO progress. That's all
> the IO layers need to guarantee to the filesystems so that forwards
> progress can always be made until memory pressure.
> 
> However, filesystems cannot guarantee "one in, one out" allocation
> behaviour. A transaction can require a largely unbound number of
> memory allocations to succeed to make progress through to
> completion, and so things like mempools -cannot be used- to prevent
> memory allocation failures whilst providing a forwards progress
> guarantee.

I don't see that that's actually true. There's no requirement that
arbitrarily large IOs must be done atomically, within a single
transaction: there's been at most talk of eventually doing atomic writes
through the pagecache, but the people on that can't even finish atomic
writes through the block layer, so who knows when that'll happen.

I generally haven't been running into filesyste operations that require
an unbounded number of memory allocations (reflink is a bit of an
exception in the current bcachefs code, and even that is just a
limitation I could solve if I really wanted to...)

> Hence a NOFAIL scope if useful at the filesystem layer for
> filesystem objects to ensure forwards progress under memory
> pressure, but it is compeltely unnecessary once we transition to the
> IO layer where forwards progress guarantees ensure memory allocation
> failures don't impede progress.
> 
> IOWs, we only need NOFAIL at the NOFS layers, not at the NOIO
> layers. The entry points to the block layer should transition the
> task to NOIO context and restore the previous context on exit. Then
> it becomes relatively trivial to apply context based filtering of
> allocation behaviour....
> 
> > > i.e NOFAIL scopes are not relevant outside the subsystem that sets
> > > it.  Hence we likely need helpers to clear and restore NOFAIL when
> > > we cross an allocation context boundaries. e.g. as we cross from
> > > filesystem to block layer in the IO stack via submit_bio(). Maybe
> > > they should be doing something like:
> > > 
> > > 	nofail_flags = memalloc_nofail_clear();
> > 
> > NOFAIL is not a scoped thing at all, period; it is very much a
> > _callsite_ specific thing, and it depends on whether that callsite has a
> > fallback.
> 
> *cough*
> 
> As I've already stated, NOFAIL allocation has been scoped in XFS for
> the past 20 years.
> 
> Every memory allocation inside a transaction *must* be NOFAIL unless
> otherwise specified because memory allocation inside a dirty
> transaction is a fatal error.

Say you start to incrementally mempoolify your allocations inside a
transaction - those mempools aren't going to do anything if there's a
scoped NOFAIL, and sorting that out is going to get messy fast.

> However, that scoping has never been
> passed to the NOIO contexts below the filesytsem - it's scoped
> purely within the filesystem itself and doesn't pass on to other
> subsystems the filesystem calls into.

How is that managed?
> 
> > The most obvious example being, as mentioned previously, mempools.
> 
> Yes, they require one-in, one-out guarantees to avoid starvation and
> ENOMEM situations. Which, as we've known since mempools were
> invented, these guarantees cannot be provided by most filesystems.
> 
> > > > - NOWAIT - as said already, we need to make sure we're not turning an
> > > > allocation that relied on too-small-to-fail into a null pointer exception or
> > > > BUG_ON(!page).
> > > 
> > > Agreed. NOWAIT is removing allocation failure constraints and I
> > > don't think that can be made to work reliably. Error injection
> > > cannot prove the absence of errors  and so we can never be certain
> > > the code will always operate correctly and not crash when an
> > > unexepected allocation failure occurs.
> > 
> > You saying we don't know how to test code?
> 
> Yes, that's exactly what I'm saying.
> 
> I'm also saying that designing algorithms that aren't fail safe is
> poor design. If you get it wrong and nothing bad can happen as a
> result, then the design is fine.
> 
> But if the result of missing something accidentally is that the
> system is guaranteed to crash when that is hit, then failure is
> guaranteed and no amount of testing will prevent that failure from
> occurring.
> 
> And we suck at testing, so we absolutely need to design fail
> safe algorithms and APIs...

GFP_NOFAIL dosen't magically make your algorithm fail safe, though.

Suren and I are trying to get memory allocation profiling into 6.9, and
I'll be posting the improved fault injection immediately afterwards -
this is what I used to use to make sure every allocation failure path in
the bcachefs predecessor was tested. Hopefully that'll make things
easier...