[PATCH RFC 0/2] xfs: optimize truncate cache flushing

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 28 Oct 2022 09:04:09 -0400

Hi all,

We have a customer that has complained about a significant performance
impact caused by commit 869ae85dae64 ("xfs: flush new eof page on
truncate to avoid post-eof corruption"). The workload is a fairly simple
repetitive, small, fixed-size buffered write followed by a truncate. The
truncate now unconditionally involves multiple flushes where previously,
once the eof block was converted, only one flush was involved.

My first reaction when revisiting the aforementioned commit was the same
as for when that commit was first introduced.. that we really shouldn't
need to flush at all here for zeroing to work correctly. Instead, the
iomap zeroing mechanism should one way or another (either in iomap or
via the fs callbacks) detect scenarios like a dirty page over an
unwritten extent and do the right thing (tm). From there, I tried a few
different experiments to get something working along those lines, but so
far haven't been able to come up with anything that is sufficiently
elegant or otherwise not too ugly or racy.

I still think something along those lines is the proper long term
solution because it addresses other scenarios where zeroing might not
technically do the right thing, but given the significance of the
performance/customer impact and implementation complexity I've fallen
back to something more incremental to try and disconnect user impact
from longer term solutions. With that in mind, the most straightforward
approach to me seemed to be to just check for the cases we explicitly
care about flushing and to skip the flush otherwise. This is
implemented by this RFC and restores original behavior for the simple
overwrite+truncate scenario.

Of course there are various other options to achieve a similar result.
Some examples:

1. Simply skip the xfs_truncate_page() branch in the (newsize ==
oldsize) scenario. This is technically a subtle behavior change in that
something like a post-eof mapped write is no longer zeroed by a
truncate. That might be reasonable as the page should still be partially
zeroed at writeback time, but it may require a bit of an audit to ensure
there are no other problems. I was initially uneasy about this approach
when I first explored it, but IIRC it did survive an fstests run, and
the priority of the regression and hopefully temporary nature has me
considering it as more acceptable.

2. Do something like unconditionally zero the eof page regardless of the
underlying extent state. This has an obvious tradeoff of unconditional
extent conversion if an unwritten block had not previously been written
by the user, but also it's not totally clear to me that such a simple
behavior change will necessarily have a simple implementation.

3. A slightly different variant of option #2 that comes to mind is to
take advantage of the fact that pagecache truncate already does partial
page zeroing (but not dirtying) and manually handle the race that the
flush is trying to prevent. For example, first look up the EOF page and
track whether it 1. exists and 2. is dirty. If so, hold a reference to
the page, call the iomap zeroing bits, call truncate_setsize() (which
falls into pagecache truncation), and then redirty the page if
(did_zeroing == false) and the page was originally dirty before we
started.

IIRC in most cases I think pagecache truncate is really what prevents
observable problems here because it does zero the eof page and for the
case of a dirty page over an unwritten extent, writeback is more likely
to complete before the iomap zeroing or after the pagecache zeroing vs.
within that race window. So the idea here would just be to ensure the
pagecache truncate zeroing makes it to disk if iomap doesn't zero. FWIW,
I hadn't yet experimented with this idea as I wrote this, and even
though it's clearly still a temporary hack, it sounded potentially
simple enough that I put a quick prototype together. I'll post that in
reply to this cover letter.

4. Given that the second filemap_write_and_wait_range() doesn't actually
wait for submitted I/O to complete for this particular workload [1],
consider whether we want to conditionally elide the second flush to
restore historically "single flush" behavior. I've quickly tested this
and it seems to show less consistent of an improvement for whatever
reason, but regardless I suspect this is too devious to consider even as
a temporary fix.

So all in all none of these options are proper long term solutions, but
the long term solutions all seem complex enough to me that I'd prefer to
take the incremental step of a mitigating fix before getting too deep
into the weeds on fixing zeroing properly.

Thoughts, reviews, flames appreciated.

Brian

[1] https://lore.kernel.org/linux-mm/20221028125428.976549-1-bfoster@xxxxxxxxxx/

Brian Foster (2):
  xfs: lift truncate iomap zeroing into a new helper
  xfs: optimize eof page flush for iomap zeroing on truncate

 fs/xfs/xfs_iops.c | 84 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 66 insertions(+), 18 deletions(-)

-- 
2.37.3