Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 12 Feb 2024 15:35:33 +1100

On Sun, Feb 11, 2024 at 09:06:33PM -0500, Kent Overstreet wrote:
> On Mon, Feb 12, 2024 at 12:20:32PM +1100, Dave Chinner wrote:
> > On Thu, Feb 08, 2024 at 08:55:05PM +0100, Vlastimil Babka (SUSE) wrote:
> > > On 2/8/24 18:33, Michal Hocko wrote:
> > > > On Thu 08-02-24 17:02:07, Vlastimil Babka (SUSE) wrote:
> > > >> On 1/9/24 05:47, Dave Chinner wrote:
> > > >> > On Thu, Jan 04, 2024 at 09:17:16PM +0000, Matthew Wilcox wrote:
> > > >> 
> > > >> Your points and Kent's proposal of scoped GFP_NOWAIT [1] suggests to me this
> > > >> is no longer FS-only topic as this isn't just about converting to the scoped
> > > >> apis, but also how they should be improved.
> > > > 
> > > > Scoped GFP_NOFAIL context is slightly easier from the semantic POV than
> > > > scoped GFP_NOWAIT as it doesn't add a potentially unexpected failure
> > > > mode. It is still tricky to deal with GFP_NOWAIT requests inside the
> > > > NOFAIL scope because that makes it a non failing busy wait for an
> > > > allocation if we need to insist on scope NOFAIL semantic. 
> > > > 
> > > > On the other hand we can define the behavior similar to what you
> > > > propose with RETRY_MAYFAIL resp. NORETRY. Existing NOWAIT users should
> > > > better handle allocation failures regardless of the external allocation
> > > > scope.
> > > > 
> > > > Overriding that scoped NOFAIL semantic with RETRY_MAYFAIL or NORETRY
> > > > resembles the existing PF_MEMALLOC and GFP_NOMEMALLOC semantic and I do
> > > > not see an immediate problem with that.
> > > > 
> > > > Having more NOFAIL allocations is not great but if you need to
> > > > emulate those by implementing the nofail semantic outside of the
> > > > allocator then it is better to have those retries inside the allocator
> > > > IMO.
> > > 
> > > I see potential issues in scoping both the NOWAIT and NOFAIL
> > > 
> > > - NOFAIL - I'm assuming Dave is adding __GFP_NOFAIL to xfs allocations or
> > > adjacent layers where he knows they must not fail for his transaction. But
> > > could the scope affect also something else underneath that could fail
> > > without the failure propagating in a way that it affects xfs?
> > 
> > Memory allocaiton failures below the filesystem (i.e. in the IO
> > path) will fail the IO, and if that happens for a read IO within
> > a transaction then it will have the same effect as XFS failing a
> > memory allocation. i.e. it will shut down the filesystem.
> > 
> > The key point here is the moment we go below the filesystem we enter
> > into a new scoped allocation context with a guaranteed method of
> > returning errors: NOIO and bio errors.
> 
> Hang on, you're conflating NOIO to mean something completely different -
> NOIO means "don't recurse in reclaim", it does _not_ mean anything about
> what happens when the allocation fails,

Yes, I know that's what NOIO means. I'm not conflating it with
anything else.

> and in particular it definitely
> does _not_ mean that failing the allocation is going to result in an IO
> error.

Exactly. FS level NOFAIL contexts simply do not apply to NOIO
context functionality. NOIO contexts require different mechanisms to
guarantee forwards progress under memory pressure. They work
pretty well, and we don't want or need to perturb them by having
them inherit filesystem level NOFAIL semantics.

i.e. architecturally speaking, NOIO is a completely separate
allocation domain to NOFS.

> That's because in general most code in the IO path knows how to make
> effective use of biosets and mempools (which may take some work! you
> have to ensure that you're always able to make forward progress when
> memory is limited, and in particular that you don't double allocate from
> the same mempool if you're blocking the first allocation from
> completing/freeing).

Yes, I understand this, and that's my point: NOIO context tends to
be able to use mempools and other mechanisms to prevent memory
allocation failure, not NOFAIL.

The IO layers are request based and that enables one-in, one out
allocation pools that can guarantee single IO progress. That's all
the IO layers need to guarantee to the filesystems so that forwards
progress can always be made until memory pressure.

However, filesystems cannot guarantee "one in, one out" allocation
behaviour. A transaction can require a largely unbound number of
memory allocations to succeed to make progress through to
completion, and so things like mempools -cannot be used- to prevent
memory allocation failures whilst providing a forwards progress
guarantee.

Hence a NOFAIL scope if useful at the filesystem layer for
filesystem objects to ensure forwards progress under memory
pressure, but it is compeltely unnecessary once we transition to the
IO layer where forwards progress guarantees ensure memory allocation
failures don't impede progress.

IOWs, we only need NOFAIL at the NOFS layers, not at the NOIO
layers. The entry points to the block layer should transition the
task to NOIO context and restore the previous context on exit. Then
it becomes relatively trivial to apply context based filtering of
allocation behaviour....

> > i.e NOFAIL scopes are not relevant outside the subsystem that sets
> > it.  Hence we likely need helpers to clear and restore NOFAIL when
> > we cross an allocation context boundaries. e.g. as we cross from
> > filesystem to block layer in the IO stack via submit_bio(). Maybe
> > they should be doing something like:
> > 
> > 	nofail_flags = memalloc_nofail_clear();
> 
> NOFAIL is not a scoped thing at all, period; it is very much a
> _callsite_ specific thing, and it depends on whether that callsite has a
> fallback.

*cough*

As I've already stated, NOFAIL allocation has been scoped in XFS for
the past 20 years.

Every memory allocation inside a transaction *must* be NOFAIL unless
otherwise specified because memory allocation inside a dirty
transaction is a fatal error. However, that scoping has never been
passed to the NOIO contexts below the filesytsem - it's scoped
purely within the filesystem itself and doesn't pass on to other
subsystems the filesystem calls into.

> The most obvious example being, as mentioned previously, mempools.

Yes, they require one-in, one-out guarantees to avoid starvation and
ENOMEM situations. Which, as we've known since mempools were
invented, these guarantees cannot be provided by most filesystems.

> > > - NOWAIT - as said already, we need to make sure we're not turning an
> > > allocation that relied on too-small-to-fail into a null pointer exception or
> > > BUG_ON(!page).
> > 
> > Agreed. NOWAIT is removing allocation failure constraints and I
> > don't think that can be made to work reliably. Error injection
> > cannot prove the absence of errors  and so we can never be certain
> > the code will always operate correctly and not crash when an
> > unexepected allocation failure occurs.
> 
> You saying we don't know how to test code?

Yes, that's exactly what I'm saying.

I'm also saying that designing algorithms that aren't fail safe is
poor design. If you get it wrong and nothing bad can happen as a
result, then the design is fine.

But if the result of missing something accidentally is that the
system is guaranteed to crash when that is hit, then failure is
guaranteed and no amount of testing will prevent that failure from
occurring.

And we suck at testing, so we absolutely need to design fail
safe algorithms and APIs...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx