On 6/11/24 2:19 PM, Christoph Hellwig wrote: > Move the cache control settings into the queue_limits so that they > can be set atomically and all I/O is frozen when changing the > flags. ...so that they can be set atomically with the device queue frozen when changing the flags. may be better. > > Add new features and flags field for the driver set flags, and internal > (usually sysfs-controlled) flags in the block layer. Note that we'll > eventually remove enough field from queue_limits to bring it back to the > previous size. > > The disable flag is inverted compared to the previous meaning, which > means it now survives a rescan, similar to the max_sectors and > max_discard_sectors user limits. > > The FLUSH and FUA flags are now inherited by blk_stack_limits, which > simplified the code in dm a lot, but also causes a slight behavior > change in that dm-switch and dm-unstripe now advertise a write cache > despite setting num_flush_bios to 0. The I/O path will handle this > gracefully, but as far as I can tell the lack of num_flush_bios > and thus flush support is a pre-existing data integrity bug in those > targets that really needs fixing, after which a non-zero num_flush_bios > should be required in dm for targets that map to underlying devices. > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > --- > .../block/writeback_cache_control.rst | 67 +++++++++++-------- > arch/um/drivers/ubd_kern.c | 2 +- > block/blk-core.c | 2 +- > block/blk-flush.c | 9 ++- > block/blk-mq-debugfs.c | 2 - > block/blk-settings.c | 29 ++------ > block/blk-sysfs.c | 29 +++++--- > block/blk-wbt.c | 4 +- > drivers/block/drbd/drbd_main.c | 2 +- > drivers/block/loop.c | 9 +-- > drivers/block/nbd.c | 14 ++-- > drivers/block/null_blk/main.c | 12 ++-- > drivers/block/ps3disk.c | 7 +- > drivers/block/rnbd/rnbd-clt.c | 10 +-- > drivers/block/ublk_drv.c | 8 ++- > drivers/block/virtio_blk.c | 20 ++++-- > drivers/block/xen-blkfront.c | 9 ++- > drivers/md/bcache/super.c | 7 +- > drivers/md/dm-table.c | 39 +++-------- > drivers/md/md.c | 8 ++- > drivers/mmc/core/block.c | 42 ++++++------ > drivers/mmc/core/queue.c | 12 ++-- > drivers/mmc/core/queue.h | 3 +- > drivers/mtd/mtd_blkdevs.c | 5 +- > drivers/nvdimm/pmem.c | 4 +- > drivers/nvme/host/core.c | 7 +- > drivers/nvme/host/multipath.c | 6 -- > drivers/scsi/sd.c | 28 +++++--- > include/linux/blkdev.h | 38 +++++++++-- > 29 files changed, 227 insertions(+), 207 deletions(-) > > diff --git a/Documentation/block/writeback_cache_control.rst b/Documentation/block/writeback_cache_control.rst > index b208488d0aae85..9cfe27f90253c7 100644 > --- a/Documentation/block/writeback_cache_control.rst > +++ b/Documentation/block/writeback_cache_control.rst > @@ -46,41 +46,50 @@ worry if the underlying devices need any explicit cache flushing and how > the Forced Unit Access is implemented. The REQ_PREFLUSH and REQ_FUA flags > may both be set on a single bio. > > +Feature settings for block drivers > +---------------------------------- > > -Implementation details for bio based block drivers > --------------------------------------------------------------- > +For devices that do not support volatile write caches there is no driver > +support required, the block layer completes empty REQ_PREFLUSH requests before > +entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from > +requests that have a payload. > > -These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit > -directly below the submit_bio interface. For remapping drivers the REQ_FUA > -bits need to be propagated to underlying devices, and a global flush needs > -to be implemented for bios with the REQ_PREFLUSH bit set. For real device > -drivers that do not have a volatile cache the REQ_PREFLUSH and REQ_FUA bits > -on non-empty bios can simply be ignored, and REQ_PREFLUSH requests without > -data can be completed successfully without doing any work. Drivers for > -devices with volatile caches need to implement the support for these > -flags themselves without any help from the block layer. > +For devices with volatile write caches the driver needs to tell the block layer > +that it supports flushing caches by setting the > > + BLK_FEAT_WRITE_CACHE > > -Implementation details for request_fn based block drivers > ---------------------------------------------------------- > +flag in the queue_limits feature field. For devices that also support the FUA > +bit the block layer needs to be told to pass on the REQ_FUA bit by also setting > +the > > -For devices that do not support volatile write caches there is no driver > -support required, the block layer completes empty REQ_PREFLUSH requests before > -entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from > -requests that have a payload. For devices with volatile write caches the > -driver needs to tell the block layer that it supports flushing caches by > -doing:: > + BLK_FEAT_FUA > + > +flag in the features field of the queue_limits structure. > + > +Implementation details for bio based block drivers > +-------------------------------------------------- > + > +For bio based drivers the REQ_PREFLUSH and REQ_FUA bit are simplify passed on > +to the driver if the drivers sets the BLK_FEAT_WRITE_CACHE flag and the drivers > +needs to handle them. > + > +*NOTE*: The REQ_FUA bit also gets passed on when the BLK_FEAT_FUA flags is > +_not_ set. Any bio based driver that sets BLK_FEAT_WRITE_CACHE also needs to > +handle REQ_FUA. > > - blk_queue_write_cache(sdkp->disk->queue, true, false); > +For remapping drivers the REQ_FUA bits need to be propagated to underlying > +devices, and a global flush needs to be implemented for bios with the > +REQ_PREFLUSH bit set. > > -and handle empty REQ_OP_FLUSH requests in its prep_fn/request_fn. Note that > -REQ_PREFLUSH requests with a payload are automatically turned into a sequence > -of an empty REQ_OP_FLUSH request followed by the actual write by the block > -layer. For devices that also support the FUA bit the block layer needs > -to be told to pass through the REQ_FUA bit using:: > +Implementation details for blk-mq drivers > +----------------------------------------- > > - blk_queue_write_cache(sdkp->disk->queue, true, true); > +When the BLK_FEAT_WRITE_CACHE flag is set, REQ_OP_WRITE | REQ_PREFLUSH requests > +with a payload are automatically turned into a sequence of a REQ_OP_FLUSH > +request followed by the actual write by the block layer. > > -and the driver must handle write requests that have the REQ_FUA bit set > -in prep_fn/request_fn. If the FUA bit is not natively supported the block > -layer turns it into an empty REQ_OP_FLUSH request after the actual write. > +When the BLK_FEA_FUA flags is set, the REQ_FUA bit simplify passed on for the s/BLK_FEA_FUA/BLK_FEAT_FUA > +REQ_OP_WRITE request, else a REQ_OP_FLUSH request is sent by the block layer > +after the completion of the write request for bio submissions with the REQ_FUA > +bit set. > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index 5c787965b7d09e..4f524c1d5e08bd 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -423,32 +423,41 @@ static ssize_t queue_io_timeout_store(struct request_queue *q, const char *page, > > static ssize_t queue_wc_show(struct request_queue *q, char *page) > { > - if (test_bit(QUEUE_FLAG_WC, &q->queue_flags)) > - return sprintf(page, "write back\n"); > - > - return sprintf(page, "write through\n"); > + if (q->limits.features & BLK_FLAGS_WRITE_CACHE_DISABLED) > + return sprintf(page, "write through\n"); > + return sprintf(page, "write back\n"); > } > > static ssize_t queue_wc_store(struct request_queue *q, const char *page, > size_t count) > { > + struct queue_limits lim; > + bool disable; > + int err; > + > if (!strncmp(page, "write back", 10)) { > - if (!test_bit(QUEUE_FLAG_HW_WC, &q->queue_flags)) > - return -EINVAL; > - blk_queue_flag_set(QUEUE_FLAG_WC, q); > + disable = false; > } else if (!strncmp(page, "write through", 13) || > - !strncmp(page, "none", 4)) { > - blk_queue_flag_clear(QUEUE_FLAG_WC, q); > + !strncmp(page, "none", 4)) { > + disable = true; > } else { > return -EINVAL; > } I think you can drop the curly brackets for this chain of if-else-if-else. > > + lim = queue_limits_start_update(q); > + if (disable) > + lim.flags |= BLK_FLAGS_WRITE_CACHE_DISABLED; > + else > + lim.flags &= ~BLK_FLAGS_WRITE_CACHE_DISABLED; > + err = queue_limits_commit_update(q, &lim); > + if (err) > + return err; > return count; > } > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > index fd789eeb62d943..fbe125d55e25b4 100644 > --- a/drivers/md/dm-table.c > +++ b/drivers/md/dm-table.c > @@ -1686,34 +1686,16 @@ int dm_calculate_queue_limits(struct dm_table *t, > return validate_hardware_logical_block_alignment(t, limits); > } > > -static int device_flush_capable(struct dm_target *ti, struct dm_dev *dev, > - sector_t start, sector_t len, void *data) > -{ > - unsigned long flush = (unsigned long) data; > - struct request_queue *q = bdev_get_queue(dev->bdev); > - > - return (q->queue_flags & flush); > -} > - > -static bool dm_table_supports_flush(struct dm_table *t, unsigned long flush) > +/* > + * Check if an target requires flush support even if none of the underlying s/an/a > + * devices need it (e.g. to persist target-specific metadata). > + */ > +static bool dm_table_supports_flush(struct dm_table *t) > { > - /* > - * Require at least one underlying device to support flushes. > - * t->devices includes internal dm devices such as mirror logs > - * so we need to use iterate_devices here, which targets > - * supporting flushes must provide. > - */ > for (unsigned int i = 0; i < t->num_targets; i++) { > struct dm_target *ti = dm_table_get_target(t, i); > > - if (!ti->num_flush_bios) > - continue; > - > - if (ti->flush_supported) > - return true; > - > - if (ti->type->iterate_devices && > - ti->type->iterate_devices(ti, device_flush_capable, (void *) flush)) > + if (ti->num_flush_bios && ti->flush_supported) > return true; > } > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index c792d4d81e5fcc..4e8931a2c76b07 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -282,6 +282,28 @@ static inline bool blk_op_is_passthrough(blk_opf_t op) > return op == REQ_OP_DRV_IN || op == REQ_OP_DRV_OUT; > } > > +/* flags set by the driver in queue_limits.features */ > +enum { > + /* supports a a volatile write cache */ Repeated "a". > + BLK_FEAT_WRITE_CACHE = (1u << 0), > + > + /* supports passing on the FUA bit */ > + BLK_FEAT_FUA = (1u << 1), > +}; > +static inline bool blk_queue_write_cache(struct request_queue *q) > +{ > + return (q->limits.features & BLK_FEAT_WRITE_CACHE) && > + (q->limits.flags & BLK_FLAGS_WRITE_CACHE_DISABLED); Hmm, shouldn't this be !(q->limits.flags & BLK_FLAGS_WRITE_CACHE_DISABLED) ? > +} > + > static inline bool bdev_write_cache(struct block_device *bdev) > { > - return test_bit(QUEUE_FLAG_WC, &bdev_get_queue(bdev)->queue_flags); > + return blk_queue_write_cache(bdev_get_queue(bdev)); > } > > static inline bool bdev_fua(struct block_device *bdev) > { > - return test_bit(QUEUE_FLAG_FUA, &bdev_get_queue(bdev)->queue_flags); > + return bdev_get_queue(bdev)->limits.features & BLK_FEAT_FUA; > } > > static inline bool bdev_nowait(struct block_device *bdev) -- Damien Le Moal Western Digital Research