On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote: > On Thu, 29 Feb 2024, Matthew Wilcox wrote: > > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote: > > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > > > > > Hello! > > > > > > > > Recent discussions [1] suggest that greater mutual understanding between > > > > memory reclaim on the one hand and RCU on the other might be in order. > > > > > > > > One possibility would be an open discussion. If it would help, I would > > > > be happy to describe how RCU reacts and responds to heavy load, along with > > > > some ways that RCU's reactions and responses could be enhanced if needed. > > > > > > > > > > Adding fsdevel as this should probably be a cross track session. > > > > Perhaps broaden this slightly. On the THP Cabal call we just had a > > conversation about the requirements on filesystems in the writeback > > path. We currently tell filesystem authors that the entire writeback > > path must avoid allocating memory in order to prevent deadlock (or use > > GFP_MEMALLOC). Is this appropriate? It's a lot of work to assure that > > writing pagecache back will not allocate memory in, eg, the network stack, > > the device driver, and any other layers the write must traverse. > > > > With the removal of ->writepage from vmscan, perhaps we can make > > filesystem authors lives easier by relaxing this requirement as pagecache > > should be cleaned long before we get to reclaiming it. > > > > I don't think there's anything to be done about swapping anon memory. > > We probably don't want to proactively write anon memory to swap, so by > > the time we're in ->swap_rw we really are low on memory. > > > > > > While we are considering revising mm rules, I would really like to > revised the rule that GFP_KERNEL allocations are allowed to fail. > I'm not at all sure that they ever do (except for large allocations - so > maybe we could leave that exception in - or warn if large allocations > are tried without a MAY_FAIL flag). > > Given that GFP_KERNEL can wait, and that the mm can kill off processes > and clear cache to free memory, there should be no case where failure is > needed or when simply waiting will eventually result in success. And if > there is, the machine is a gonner anyway. Yes, please! XFS was designed and implemented on an OS that gave this exact guarantee for kernel allocations back in the early 1990s. Memory allocation simply blocked until it succeeded unless the caller indicated they could handle failure. That's what __GFP_NOFAIL does and XFS is still heavily dependent on this behaviour. And before people scream "but that was 30 years ago, Unix OS code was much simpler", consider that Irix supported machines with hundreds of NUMA nodes, thousands of CPUs, terabytes of memory and petabytes of storage. It had variable size high order pages in the page cache (something we've only just got with folios!), page migration, page compaction, memory and process locality control, filesystem block sizes larger than page size (which we don't have yet!), memory shrinkers for subsystem cache reclaim, page cache dirty throttling to sustained writeback IO rates, etc. Lots of the mm technology from that OS has been re-implemented in Linux in the past two decades, but in several important ways Linux still falls shy of the bar that Irix set a couple of decades ago. One of those is the kernel memory allocation guarantee. > Once upon a time user-space pages could not be ripped out of a process > by the oom killer until the process actually exited, and that meant that > GFP_KERNEL allocations of a process being oom killed should not block > indefinitely in the allocator. I *think* that isn't the case any more. > > Insisting that GFP_KERNEL allocations never returned NULL would allow us > to remove a lot of untested error handling code.... This is the sort of thing I was thinking of in the "remove GFP_NOFS" discussion thread when I said this to Kent: "We need to start designing our code in a way that doesn't require extensive testing to validate it as correct. If the only way to validate new code is correct is via stochastic coverage via error injection, then that is a clear sign we've made poor design choices along the way." https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@xxxxxxxxxxxxxxxxxxx/ If memory allocation doesn't fail by default, then we can remove the vast majority of allocation error handling from the kernel. Make the common case just work - remove the need for all that code to handle failures that is hard to exercise reliably and so are rarely tested. A simple change to make long standing behaviour an actual policy we can rely on means we can remove both code and test matrix overhead - it's a win-win IMO. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx