Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Tue, 27 Feb 2024 05:07:30 -0500

On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> Part of the testing we have done with LBS was to do some performance
> tests on XFS to ensure things are not regressing. Building linux is a
> fine decent test and we did some random cloud instance tests on that and
> presented that at Plumbers, but it doesn't really cut it if we want to
> push things to the limit though. What are the limits to buffered IO
> and how do we test that? Who keeps track of it?
> 
> The obvious recurring tension is that for really high performance folks
> just recommend to use birect IO. But if you are stress testing changes
> to a filesystem and want to push buffered IO to its limits it makes
> sense to stick to buffered IO, otherwise how else do we test it?
> 
> It is good to know limits to buffered IO too because some workloads
> cannot use direct IO.  For instance PostgreSQL doesn't have direct IO
> support and even as late as the end of last year we learned that adding
> direct IO to PostgreSQL would be difficult.  Chris Mason has noted also
> that direct IO can also force writes during reads (?)... Anyway, testing
> the limits of buffered IO limits to ensure you are not creating
> regressions when doing some page cache surgery seems like it might be
> useful and a sensible thing to do .... The good news is we have not found
> regressions with LBS but all the testing seems to beg the question, of what
> are the limits of buffered IO anyway, and how does it scale? Do we know, do
> we care? Do we keep track of it? How does it compare to direct IO for some
> workloads? How big is the delta? How do we best test that? How do we
> automate all that? Do we want to automatically test this to avoid regressions?
> 
> The obvious issues with some workloads for buffered IO is having a
> possible penality if you are not really re-using folios added to the
> page cache. Jens Axboe reported a while ago issues with workloads with
> random reads over a data set 10x the size of RAM and also proposed
> RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
> like direct IO with kernel pages and a memcpy(), and it requires
> further serialization to be implemented that we already do for
> direct IO for writes. There at least seems to be agreement that if we're
> going to provide an enhancement or alternative that we should strive to not
> make the same mistakes we've done with direct IO. The rationale for some
> workloads to use buffered IO is it helps reduce some tail latencies, so
> that's something to live up to.
> 
> On that same thread Christoph also mentioned the possibility of a direct
> IO variant which can leverage the cache. Is that something we want to
> move forward with?
> 
> Chris Mason also listed a few other desirables if we do:
> 
> - Allowing concurrent writes (xfs DIO does this now)

AFAIK every filesystem allows concurrent direct writes, not just xfs,
it's _buffered_ writes that we care about here.

I just pushed a patch to my CI for buffered writes without taking the
inode lock - for bcachefs. It'll be straightforward, but a decent amount
of work, to lift this to the VFS, if people are interested in
collaborating.

https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking

The approach is: for non extending, non appending writes, see if we can
pin the entire range of the pagecache we're writing to; fall back to
taking the inode lock if we can't.

If we do a short write because of a page fault (despite previously
faulting in the userspace buffer), there is no way to completely prevent
torn writes an atomicity breakage; we could at least try a trylock on
the inode lock, I didn't do that here.

For lifting this to the VFS, this needs
 - My darray code, which I'll be moving to include/linux/ in the 6.9
   merge window
 - My pagecache add lock - we need this for sychronization with hole
   punching and truncate when we don't have the inode lock.
 - My vectorized buffered write path lifted to filemap.c, which means we
   need some sort of vectorized replacement for .write_begin and
   .write_end