On Fri, Jul 29, 2022 at 04:11:45PM +0200, Christoph Hellwig wrote: > On Fri, Jul 29, 2022 at 10:22:16AM +0100, Mel Gorman wrote: > > There is some context missing because it's not clear what the full impact is > > but it is definitly the case that writepage is ignored in some contexts for > > common filesystems so lets assume that writepage from reclaim context always > > failed as a worst case scenario. Certainly this type of change is something > > linux-mm needs to be aware of because we've been blind-sided before. > > Between willy and Johannes pushing or it I was under the strong assumption > that linux-mm knows of it.. Yes, the context was an FS session at LSFMM. FS folks complained about the MM relying on single-page writeouts. On the MM side we've invested a lot into eliminating this dependency over the last decade or so, and I would argue it's gone today. But the calls are still in the code, and so FS folks continue to operate under the old assumption. You can't blame them. I suggested we remove the callbacks to clarify things and eliminate that murky corner from the FS/MM interface. Compaction/migration is easy. It simply never calls writepage when there is a migratepage callback - which major filesystems have. Reclaim may still call it, but the invocation rules are so restrictive nowadays that it's unlikely to actually help when it matters (and we know it makes things worse in many cases). For example, cgroup reclaim isn't ever allowed to call writepage. This covers the small system scenario. Whether writepage helps under OOM is not clear. OOMing systems tend to have thundering herds of direct reclaimers, any one of which can declare OOM if they fail, yet none of them can write pages. They already rely on another thread to make progress. That thread can be the flushers writing in offset order, or kswapd writing in LRU order. You could argue that LRU order, while less efficient IO, launders pages closer to the scanner. From an OOM perspective that doesn't matter, though, because scanners will work through the entire LRU list several times before giving up. They won't miss flusher progress. That leaves reclaim efficiency - having to scan fewer pages before potentially finding clean ones. But it's an IO bound scenario, so arguably efficient IO would seem more important than efficient CPU. The risk of this change, IMO, is exposing reclaim to flat out bugs in the flusher code, or bugs in the code that matches reclaim to flushing speed. However, a) cgroup has been relying on those for a decade. And b) we've been treating writepage calls like bugs due to the latency they inject into workloads, and tuned the MM to rely more on flushers (e.g. c55e8d035b28 ("mm: vmscan: move dirty pages out of the way until they're flushed")). So we know this stuff works at scale and with real workloads. I think the risk of dragons there is quite low. XFS hasn't had a ->writepage call for a while. After LSF I internally tested dropping btrfs' callback, and the results looked good: no OOM kills with dirty/writeback pages remaining, performance parity. Then I went on vacation and Christoph beat me to the patch :) I think it's a really good cleanup that makes things cleaner and more predictable in both the fs and the mm. > > I don't think it would be incredibly damaging although there *might* be > > issues with small systems or cgroups. > > Johannes specifically mentioned that cgroup writeback will never call > into ->writepage anyway. Yes, cgroup has relied on the flushers since commit ee72886d8ed5d9de3fa0ed3b99a7ca7702576a96 Author: Mel Gorman <mel@xxxxxxxxx> Date: Mon Oct 31 17:07:38 2011 -0700 mm: vmscan: do not writeback filesystem pages in direct reclaim since cgroup reclaim == direct reclaim.