On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > On Thu, 29 Feb 2024, Matthew Wilcox wrote: > > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > > > > > Hello! > > > > > > > > Recent discussions [1] suggest that greater mutual understanding between > > > > memory reclaim on the one hand and RCU on the other might be in order. > > > > > > > > One possibility would be an open discussion. If it would help, I would > > > > be happy to describe how RCU reacts and responds to heavy load, along with > > > > some ways that RCU's reactions and responses could be enhanced if needed. > > > > > > > > > > Adding fsdevel as this should probably be a cross track session. > > > > Perhaps broaden this slightly. On the THP Cabal call we just had a > > conversation about the requirements on filesystems in the writeback > > path. We currently tell filesystem authors that the entire writeback > > path must avoid allocating memory in order to prevent deadlock (or use > > GFP_MEMALLOC). Is this appropriate? It's a lot of work to assure that > > writing pagecache back will not allocate memory in, eg, the network stack, > > the device driver, and any other layers the write must traverse. > > > > With the removal of ->writepage from vmscan, perhaps we can make > > filesystem authors lives easier by relaxing this requirement as pagecache > > should be cleaned long before we get to reclaiming it. > > > > I don't think there's anything to be done about swapping anon memory. > > We probably don't want to proactively write anon memory to swap, so by > > the time we're in ->swap_rw we really are low on memory. > > > > > > While we are considering revising mm rules, I would really like to > revised the rule that GFP_KERNEL allocations are allowed to fail. > I'm not at all sure that they ever do (except for large allocations - so > maybe we could leave that exception in - or warn if large allocations > are tried without a MAY_FAIL flag). > > Given that GFP_KERNEL can wait, and that the mm can kill off processes > and clear cache to free memory, there should be no case where failure is > needed or when simply waiting will eventually result in success. And if > there is, the machine is a gonner anyway. > > Once upon a time user-space pages could not be ripped out of a process > by the oom killer until the process actually exited, and that meant that > GFP_KERNEL allocations of a process being oom killed should not block > indefinitely in the allocator. I *think* that isn't the case any more. > > Insisting that GFP_KERNEL allocations never returned NULL would allow us > to remove a lot of untested error handling code.... If memcg ever gets enabled for all kernel side allocations we might start seeing failures of GFP_KERNEL allocations. I've got better fault injection code coming, I'll be posting it right after memory allocation profiling gets merged - that'll help with the testing situation. The big blocker on enabling memcg for all kernel allocations is performance overhead, but I hear that's getting worked on as well. We'd probably want to add a gfp flag to annotate which allocations we want to fail because of memcg, though...