On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > Part of the testing we have done with LBS was to do some performance > tests on XFS to ensure things are not regressing. Building linux is a > fine decent test and we did some random cloud instance tests on that and > presented that at Plumbers, but it doesn't really cut it if we want to > push things to the limit though. What are the limits to buffered IO > and how do we test that? Who keeps track of it? > > The obvious recurring tension is that for really high performance folks > just recommend to use birect IO. But if you are stress testing changes > to a filesystem and want to push buffered IO to its limits it makes > sense to stick to buffered IO, otherwise how else do we test it? > > It is good to know limits to buffered IO too because some workloads > cannot use direct IO. For instance PostgreSQL doesn't have direct IO > support and even as late as the end of last year we learned that adding > direct IO to PostgreSQL would be difficult. Chris Mason has noted also > that direct IO can also force writes during reads (?)... Anyway, testing > the limits of buffered IO limits to ensure you are not creating > regressions when doing some page cache surgery seems like it might be > useful and a sensible thing to do .... The good news is we have not found > regressions with LBS but all the testing seems to beg the question, of what > are the limits of buffered IO anyway, and how does it scale? Do we know, do > we care? Do we keep track of it? How does it compare to direct IO for some > workloads? How big is the delta? How do we best test that? How do we > automate all that? Do we want to automatically test this to avoid regressions? > > The obvious issues with some workloads for buffered IO is having a > possible penality if you are not really re-using folios added to the > page cache. Jens Axboe reported a while ago issues with workloads with > random reads over a data set 10x the size of RAM and also proposed > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more > like direct IO with kernel pages and a memcpy(), and it requires > further serialization to be implemented that we already do for > direct IO for writes. There at least seems to be agreement that if we're > going to provide an enhancement or alternative that we should strive to not > make the same mistakes we've done with direct IO. The rationale for some > workloads to use buffered IO is it helps reduce some tail latencies, so > that's something to live up to. > > On that same thread Christoph also mentioned the possibility of a direct > IO variant which can leverage the cache. Is that something we want to > move forward with? The thing to consider here would be an improved O_SYNC. There's a fair amount of tree walking and thread to thread cacheline bouncing that would be avoided by just calling .write_folios() and kicking bios off from .write_iter(). OTOH - the way it's done now is probably the best possible way of splitting up the work between multiple threads, so I'd expect this approach to get less throughput than current O_SYNC. Luis, are you profiling these workloads? I haven't looked at high throughput profiles of the buffered IO path in years, and that's a good place to start.