On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > Part of the testing we have done with LBS was to do some performance > tests on XFS to ensure things are not regressing. Building linux is a > fine decent test and we did some random cloud instance tests on that and > presented that at Plumbers, but it doesn't really cut it if we want to > push things to the limit though. What are the limits to buffered IO > and how do we test that? Who keeps track of it? > > The obvious recurring tension is that for really high performance folks > just recommend to use birect IO. But if you are stress testing changes > to a filesystem and want to push buffered IO to its limits it makes > sense to stick to buffered IO, otherwise how else do we test it? > > It is good to know limits to buffered IO too because some workloads > cannot use direct IO. For instance PostgreSQL doesn't have direct IO > support and even as late as the end of last year we learned that adding > direct IO to PostgreSQL would be difficult. Chris Mason has noted also > that direct IO can also force writes during reads (?)... Anyway, testing > the limits of buffered IO limits to ensure you are not creating > regressions when doing some page cache surgery seems like it might be > useful and a sensible thing to do .... The good news is we have not found > regressions with LBS but all the testing seems to beg the question, of what > are the limits of buffered IO anyway, and how does it scale? Do we know, do > we care? Do we keep track of it? How does it compare to direct IO for some > workloads? How big is the delta? How do we best test that? How do we > automate all that? Do we want to automatically test this to avoid regressions? > > The obvious issues with some workloads for buffered IO is having a > possible penality if you are not really re-using folios added to the > page cache. Jens Axboe reported a while ago issues with workloads with > random reads over a data set 10x the size of RAM and also proposed > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more > like direct IO with kernel pages and a memcpy(), and it requires > further serialization to be implemented that we already do for > direct IO for writes. There at least seems to be agreement that if we're > going to provide an enhancement or alternative that we should strive to not > make the same mistakes we've done with direct IO. The rationale for some > workloads to use buffered IO is it helps reduce some tail latencies, so > that's something to live up to. > > On that same thread Christoph also mentioned the possibility of a direct > IO variant which can leverage the cache. Is that something we want to > move forward with? > > Chris Mason also listed a few other desirables if we do: > > - Allowing concurrent writes (xfs DIO does this now) AFAIK every filesystem allows concurrent direct writes, not just xfs, it's _buffered_ writes that we care about here. I just pushed a patch to my CI for buffered writes without taking the inode lock - for bcachefs. It'll be straightforward, but a decent amount of work, to lift this to the VFS, if people are interested in collaborating. https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking The approach is: for non extending, non appending writes, see if we can pin the entire range of the pagecache we're writing to; fall back to taking the inode lock if we can't. If we do a short write because of a page fault (despite previously faulting in the userspace buffer), there is no way to completely prevent torn writes an atomicity breakage; we could at least try a trylock on the inode lock, I didn't do that here. For lifting this to the VFS, this needs - My darray code, which I'll be moving to include/linux/ in the 6.9 merge window - My pagecache add lock - we need this for sychronization with hole punching and truncate when we don't have the inode lock. - My vectorized buffered write path lifted to filemap.c, which means we need some sort of vectorized replacement for .write_begin and .write_end