On Thu, Jun 22, 2023 at 03:55:52AM +0100, Matthew Wilcox wrote: > On Thu, Jun 22, 2023 at 11:55:23AM +1000, Dave Chinner wrote: > > Ok, so having spent a bit more thought on this away from the office > > this morning, I think there is a generic way we can avoid deferring > > completions for pure overwrites. > > OK, this is how we can, but should we? The same amount of work > needs to be done, no matter whether we do it in interrupt context or > workqueue context. Doing it in interrupt context has lower latency, > but maybe allows us to batch up the work and so get better bandwidth. > And we can't handle other interrupts while we're handling this one, > so from a whole-system perspective, I think we'd rather do the work in > the workqueue. Yup, I agree with you there, but I can also be easily convinced that optimising the pure in-place DIO overwrite path is worth the effort. > Latency is important for reads, but why is it important for writes? > There's such a thing as a dependent read, but writes are usually buffered > and we can wait as long as we like for a write to complete. The OP cares about async direct IO performance, not buffered writes. And for DIO writes, there is most definitely such a thing as "dependent writes". Think about journalled data - you can't overwrite data in place until the data write to the journal has first completed all the way down to stable storage. i.e. there's an inherent IO completion-to-submission write ordering constraint in the algorithm, and so we have dependent writes. And that's the whole point of the DIO write FUA optimisations in iomap; they avoid the dependent "write" that provides data integrity i.e. the journal flush and/or device cache flush that generic_write_sync() issues in IO completion is a dependent write because it cannot start until all the data being written has reached the device entirely. Using completion-to-submission ordering of the integrity operations means we don't need to block other IOs to the same file, other journal operations in the filesystem or other data IO to provide that data integrity requirement for the specific O_DSYNC DIO write IO. If we can use an FUA write for this instead of a separate cache flush, then we end up providing O_DSYNC writes with about 40% lower completion latency than a "write + cache flush" sequential IO pair. This means that things like high performance databases improve throughput by 25-50% and operational latency goes down by ~30-40% if we can make extensive use of FUA writes to provide the desired data integrity guarantees. >From that perspective, an application doing pure overwrites with ordering depedencies might actually be very dependent on minimising individual DIO write latency for overall performance... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx