在 2024/3/1 8:40, Dave Chinner 写道:
Hi Dave, thanks for your detailed and nice suggestions, I have a few
questions below.
On Thu, Feb 29, 2024 at 07:38:48PM +0800, Zhihao Cheng wrote:
It will be more efficient to execute quick endio process(eg. non-sync
overwriting case) under irq process rather than starting a worker to
do it.
Add a flag to control DIO to be finished inline(under irq context), which
can be used for non-sync overwriting case.
Besides, skip invalidating pages if DIO is finished inline, which will
keep the same logic with dio_bio_end_aio in non-sync overwriting case.
Signed-off-by: Zhihao Cheng <chengzhihao1@xxxxxxxxxx>
A nice idea, but I don't think an ext4 specific API flag is the
right way to go about enabling this. The iomap dio code knows if
the write is pure overwrite already - we have the IOMAP_F_DIRTY flag
for that, and we combine this with IOMAP_DIO_WRITE_THROUGH to do the
pure overwrite FUA optimisations.
That is:
/*
* Use a FUA write if we need datasync semantics, this is a pure
* data IO that doesn't require any metadata updates (including
* after IO completion such as unwritten extent conversion) and
* the underlying device either supports FUA or doesn't have
* a volatile write cache. This allows us to avoid cache flushes
* on IO completion. If we can't use writethrough and need to
* sync, disable in-task completions as dio completion will
* need to call generic_write_sync() which will do a blocking
* fsync / cache flush call.
*/
if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
(dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
(bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
use_fua = true;
Hence if we want to optimise pure overwrites that have no data sync
requirements, we already have the detection and triggers in place to
do this. We just need to change the way we set up the IO flags to
allow write-through (i.e. non-blocking IO completions) to use inline
completions.
In __iomap_dio_rw():
+ /* Always try to complete inline. */
+ dio->flags |= IOMAP_DIO_INLINE_COMP;
if (iov_iter_rw(iter) == READ) {
- /* reads can always complete inline */
- dio->flags |= IOMAP_DIO_INLINE_COMP;
....
} else {
+ /* Always try write-through semantics. If we can't
+ * use writethough, it will be disabled along with
+ * IOMAP_DIO_INLINE_COMP before dio completion is run
+ * so it can be deferred to a task completion context
+ * appropriately.
+ */
+ dio->flags |= IOMAP_DIO_WRITE | IOMAP_DIO_WRITE_THROUGH;
There is a behavior change here, if we set IOMAP_DIO_WRITE_THROUGH
unconditionally, non-datasync IO will be set with REQ_FUA, which means
that device will flush writecache for each IO, will it affect the
performance in non-sync dio case?
iomi.flags |= IOMAP_WRITE;
- dio->flags |= IOMAP_DIO_WRITE;
.....
/* for data sync or sync, we need sync completion processing */
if (iocb_is_dsync(iocb)) {
dio->flags |= IOMAP_DIO_NEED_SYNC;
- /*
- * For datasync only writes, we optimistically try using
- * WRITE_THROUGH for this IO. This flag requires either
- * FUA writes through the device's write cache, or a
- * normal write to a device without a volatile write
- * cache. For the former, Any non-FUA write that occurs
- * will clear this flag, hence we know before completion
- * whether a cache flush is necessary.
- */
- if (!(iocb->ki_flags & IOCB_SYNC))
- dio->flags |= IOMAP_DIO_WRITE_THROUGH;
+ * For sync writes we know we are going to need
+ * blocking completion processing, so turn off
+ * writethrough now.
+ */
if (iocb->ki_flags & IOCB_SYNC) {
dio->flags &= ~(IOMAP_DIO_WRITE_THROUGH |
IOMAP_DIO_INLINE_COMP);
}
}
[...]
However, this does mean that any spinlock taken in the ->end_io()
callbacks now needs to be irq safe. e.g. in xfs_dio_write_end_io()
the spinlock protection around inode size updates will need to use
an irq safe locking, as will the locking in the DIO submission path
that it serialises against in xfs_file_write_checks(). That probably
is best implemented as a separate spinlock.
There will also be other filesystems that need to set IOMAP_F_DIRTY
unconditionally (e.g. zonefs) because they always take blocking
locks in their ->end_io callbacks and so must always run in task
context...
Should we add a new flag(eg. IOMAP_F_ENDIO_IRQ ?) to indicate that the
endio cannot be done under irq? Because I think IOMAP_F_DIRTY means that
the metadata needs to be written, if we add a new semantics(endio must
be done in defered work) for this flag, the code will looks a little
complicated.