Re: remove iomap_writepage v2

Johannes Weiner <hannes@xxxxxxxxxxx> · Mon, 1 Aug 2022 11:31:50 -0400

On Fri, Jul 29, 2022 at 04:11:45PM +0200, Christoph Hellwig wrote:
> On Fri, Jul 29, 2022 at 10:22:16AM +0100, Mel Gorman wrote:
> > There is some context missing because it's not clear what the full impact is
> > but it is definitly the case that writepage is ignored in some contexts for
> > common filesystems so lets assume that writepage from reclaim context always
> > failed as a worst case scenario. Certainly this type of change is something
> > linux-mm needs to be aware of because we've been blind-sided before.
> 
> Between willy and Johannes pushing or it I was under the strong assumption
> that linux-mm knows of it..

Yes, the context was an FS session at LSFMM. FS folks complained about
the MM relying on single-page writeouts. On the MM side we've invested
a lot into eliminating this dependency over the last decade or so, and
I would argue it's gone today. But the calls are still in the code,
and so FS folks continue to operate under the old assumption. You
can't blame them. I suggested we remove the callbacks to clarify
things and eliminate that murky corner from the FS/MM interface.

Compaction/migration is easy. It simply never calls writepage when
there is a migratepage callback - which major filesystems have.

Reclaim may still call it, but the invocation rules are so restrictive
nowadays that it's unlikely to actually help when it matters (and we
know it makes things worse in many cases).

For example, cgroup reclaim isn't ever allowed to call writepage.
This covers the small system scenario.

Whether writepage helps under OOM is not clear. OOMing systems tend to
have thundering herds of direct reclaimers, any one of which can
declare OOM if they fail, yet none of them can write pages. They
already rely on another thread to make progress. That thread can be
the flushers writing in offset order, or kswapd writing in LRU
order. You could argue that LRU order, while less efficient IO,
launders pages closer to the scanner. From an OOM perspective that
doesn't matter, though, because scanners will work through the entire
LRU list several times before giving up. They won't miss flusher
progress. That leaves reclaim efficiency - having to scan fewer pages
before potentially finding clean ones. But it's an IO bound scenario,
so arguably efficient IO would seem more important than efficient CPU.

The risk of this change, IMO, is exposing reclaim to flat out bugs in
the flusher code, or bugs in the code that matches reclaim to flushing
speed. However, a) cgroup has been relying on those for a decade. And
b) we've been treating writepage calls like bugs due to the latency
they inject into workloads, and tuned the MM to rely more on flushers
(e.g. c55e8d035b28 ("mm: vmscan: move dirty pages out of the way until
they're flushed")). So we know this stuff works at scale and with real
workloads. I think the risk of dragons there is quite low.

XFS hasn't had a ->writepage call for a while. After LSF I internally
tested dropping btrfs' callback, and the results looked good: no OOM
kills with dirty/writeback pages remaining, performance parity. Then I
went on vacation and Christoph beat me to the patch :)

I think it's a really good cleanup that makes things cleaner and more
predictable in both the fs and the mm.

> > I don't think it would be incredibly damaging although there *might* be
> > issues with small systems or cgroups. 
> 
> Johannes specifically mentioned that cgroup writeback will never call
> into ->writepage anyway.

Yes, cgroup has relied on the flushers since

commit ee72886d8ed5d9de3fa0ed3b99a7ca7702576a96
Author: Mel Gorman <mel@xxxxxxxxx>
Date:   Mon Oct 31 17:07:38 2011 -0700

    mm: vmscan: do not writeback filesystem pages in direct reclaim

since cgroup reclaim == direct reclaim.