On Sun, Sep 1, 2024 at 11:35 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Fri, Aug 30, 2024 at 05:14:28PM +0800, Yafang Shao wrote: > > On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote: > > > > Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe > > > > according to call path and allocation size, you still need to be > > > > checking for failure - in the same way that you shouldn't be using > > > > BUG_ON() if you cannot prove that the condition won't occur in real wold > > > > usage. > > > > > > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years > > > now. This was the default Irix kernel allocator behaviour (it had a > > > forwards progress guarantee and would never fail allocation unless > > > told it could do so). We've been using the same "guaranteed not to > > > fail" semantics on Linux since the original port started 25 years > > > ago via open-coded loops. > > > > > > IOWs, __GFP_NOFAIL semantics have been production tested for a > > > couple of decades on Linux via XFS, and nobody here can argue that > > > XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL > > > as it is used by XFS is reliable and lives up to the "will not fail" > > > guarantee that it is supposed to have. > > > > > > Fundamentally, __GFP_NOFAIL came about to replace the callers doing > > > > > > do { > > > p = kmalloc(size); > > > while (!p); > > > > > > so that they blocked until memory allocation succeeded. The call > > > sites do not check for failure, because -failure never occurs-. > > > > > > The MM devs want to have visibility of these allocations - they may > > > not like them, but having __GFP_NOFAIL means it's trivial to audit > > > all the allocations that use these semantics. IOWs, __GFP_NOFAIL > > > was created with an explicit guarantee that it -will not fail- for > > > normal allocation contexts so it could replace all the open-coded > > > will-not-fail allocation loops.. > > > > > > Given this guarantee, we recently removed these historic allocation > > > wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the > > > allocation call sites. There's nearly a hundred memory allocation > > > locations in XFS that are tagged with __GFP_NOFAIL. > > > > > > If we're now going to have the "will not fail" guarantee taken away > > > from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can > > > it be used anywhere else that a "will not fail" guarantee it > > > required. > > > > > > Put simply: __GFP_NOFAIL will be rendered completely useless if it > > > can fail due to external scoped memory allocation contexts. This > > > will force us to revert all __GFP_NOFAIL allocations back to > > > open-coded will-not-fail loops. > > > > > > This is not a step forwards for anyone. > > > > Hello Dave, > > > > I've noticed that XFS has increasingly replaced kmem_alloc() with > > __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of > > __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41 > > occurrences. In kmem_alloc(), there's an explicit > > memalloc_retry_wait() to throttle the allocator under heavy memory > > pressure, which aligns with your filesystem design. However, using > > __GFP_NOFAIL removes this throttling mechanism, potentially causing > > issues when the system is under heavy memory load. I'm concerned that > > this shift might not be a beneficial trend. > > AIUI, the memory allocation looping has back-offs already built in > to it when memory reserves are exhausted and/or reclaim is > congested. > > e.g: > > get_page_from_freelist() > (zone below watermark) > node_reclaim() > __node_reclaim() > shrink_node() > reclaim_throttle() It applies to all kinds of allocations. > > And the call to recalim_throttle() will do the equivalent of > memalloc_retry_wait() (a 2ms sleep). I'm wondering if we should take special action for __GFP_NOFAIL, as currently, it only results in an endless loop with no intervention. > > > We have been using XFS for our big data servers for years, and it has > > consistently performed well with older kernels like 4.19.y. However, > > after upgrading all our servers from 4.19.y to 6.1.y over the past two > > years, we have frequently encountered livelock issues caused by memory > > exhaustion. To mitigate this, we've had to limit the RSS of > > applications, which isn't an ideal solution and represents a worrying > > trend. > > If userspace uses all of memory all the time, then the best the > kernel can do is slowly limp along. Preventing userspace from > overcommitting memory to the point of OOM is the only way to avoid > these "userspace space wants more memory than the machine physically > has" sorts of issues. i.e. this is not a problem that the kernel > code can solve short of randomly killing userspace applications... We expect an OOM event, but it never occurs, which is a problem. -- Regards Yafang