On Wed, Nov 24, 2021 at 02:16:56PM +1100, NeilBrown wrote: > On Wed, 24 Nov 2021, Andrew Morton wrote: > > > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of > > sites were doing open-coded try-forever loops. I thought "hey, they > > shouldn't be doing that in the first place, but let's at least > > centralize the concept to reduce code size, code duplication and so > > it's something we can now grep for". But longer term, all GFP_NOFAIL > > sites should be reworked to no longer need to do the retry-forever > > thing. In retrospect, this bright idea of mine seems to have added > > license for more sites to use retry-forever. Sigh. > > One of the costs of not having GFP_NOFAIL (or similar) is lots of > untested failure-path code. > > When does an allocation that is allowed to retry and reclaim ever fail > anyway? I think the answer is "only when it has been killed by the oom > killer". That of course cannot happen to kernel threads, so maybe > kernel threads should never need GFP_NOFAIL?? > > I'm not sure the above is 100%, but I do think that is the sort of > semantic that we want. We want to know what kmalloc failure *means*. > We also need well defined and documented strategies to handle it. > mempools are one such strategy, but not always suitable. mempools are not suitable for anything that uses demand paging to hold an unbounded data set in memory before it can free anything. This is basically the definition of memory demand in an XFS transaction, and most transaction based filesystems have similar behaviour. > preallocating can also be useful but can be clumsy to implement. Maybe > we should support a process preallocating a bunch of pages which can > only be used by the process - and are auto-freed when the process > returns to user-space. That might allow the "error paths" to be simple > and early, and subsequent allocations that were GFP_USEPREALLOC would be > safe. We talked about this years ago at LSFMM (2013 or 2014, IIRC). The problem is "how much do you preallocate when the worst case requirement to guarantee forwards progress is at least tens of megabytes". Considering that there might be thousands of these contexts running concurrent at any given time and we might be running through several million preallocation contexts a second, suddenly preallocation is a great big CPU and memory pit. Hence preallocation simply doesn't work when the scope to guarantee forwards progress is in the order of megabytes (even tens of megabytes) per "no fail" context scope. In situations like this we need -memory reservations-, not preallocation. Have the mm guarantee that there is a certain amount of memory available (even if it requires reclaim to make available) before we start the operation that cannot tolerate memory allocation failure, track the memory usage as it goes, warn if it overruns, and release the unused part of the reservation when context completes. [ This is what we already do in XFS for guaranteeing forwards progress for writing modifications into the strictly bound journal space. We reserve space up front and use tracking tickets to account for space used across each transaction context. This avoids overcommit and all the deadlock and corruption problems that come from running out of physical log space to write all the changes we've made into the log. We could simply add memory reservations and tracking structures to the transaction context and we've pretty much got everything we need on the XFS side covered... ] > i.e. we need a plan for how to rework all those no-fail call-sites. Even if we do make them all the filesystems handle ENOMEM errors gracefully and pass that back up to userspace, how are applications going to react to random ENOMEM errors when doing data writes or file creation or any other operation that accesses filesystems? Given the way applications handle transient errors (i.e. they don't!) propagating ENOMEM back up to userspace will result in applications randomly failing under memory pressure. That's a much worse situation than having to manage the _extremely rare_ issues that occur because of __GFP_NOFAIL usage in the kernel. Let's keep that in mind here - __GFP_NOFAIL usage is not causing system failures in the real world, whilst propagating ENOMEM back out to userspace is potentially very harmful to users.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx