Re: [PATCH 0/1] iomap regression for aio dio 4k writes

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 22 Jun 2023 14:47:59 +1000

On Thu, Jun 22, 2023 at 03:55:52AM +0100, Matthew Wilcox wrote:
> On Thu, Jun 22, 2023 at 11:55:23AM +1000, Dave Chinner wrote:
> > Ok, so having spent a bit more thought on this away from the office
> > this morning, I think there is a generic way we can avoid deferring
> > completions for pure overwrites.
> 
> OK, this is how we can, but should we?  The same amount of work
> needs to be done, no matter whether we do it in interrupt context or
> workqueue context.  Doing it in interrupt context has lower latency,
> but maybe allows us to batch up the work and so get better bandwidth.
> And we can't handle other interrupts while we're handling this one,
> so from a whole-system perspective, I think we'd rather do the work in
> the workqueue.

Yup, I agree with you there, but I can also be easily convinced that
optimising the pure in-place DIO overwrite path is worth the effort.

> Latency is important for reads, but why is it important for writes?
> There's such a thing as a dependent read, but writes are usually buffered
> and we can wait as long as we like for a write to complete.

The OP cares about async direct IO performance, not buffered writes.
And for DIO writes, there is most definitely such a thing as
"dependent writes".

Think about journalled data - you can't overwrite data in place
until the data write to the journal has first completed all the way
down to stable storage.  i.e. there's an inherent IO
completion-to-submission write ordering constraint in the algorithm,
and so we have dependent writes.

And that's the whole point of the DIO write FUA optimisations in
iomap; they avoid the dependent "write" that provides data integrity
i.e.  the journal flush and/or device cache flush that
generic_write_sync() issues in IO completion is a dependent write
because it cannot start until all the data being written has reached
the device entirely.

Using completion-to-submission ordering of the integrity operations
means we don't need to block other IOs to the same file, other
journal operations in the filesystem or other data IO to provide
that data integrity requirement for the specific O_DSYNC DIO write
IO. If we can use an FUA write for this instead of a separate cache
flush, then we end up providing O_DSYNC writes with about 40% lower
completion latency than a "write + cache flush" sequential IO pair.

This means that things like high performance databases improve
throughput by 25-50% and operational latency goes down by ~30-40% if
we can make extensive use of FUA writes to provide the desired data
integrity guarantees.

>From that perspective, an application doing pure overwrites with
ordering depedencies might actually be very dependent on minimising
individual DIO write latency for overall performance...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx