On Fri, Feb 02, 2024 at 04:30:44PM +0900, Damien Le Moal wrote: > Zone write plugging implements a per-zone "plug" for write operations to > tightly control the submission and execution order of writes to > sequential write required zones of a zoned block device. Per-zone > plugging of writes guarantees that at any time at most one write request > per zone is in flight. This mechanism is intended to replace zone write > locking which is controlled at the scheduler level and implemented only > by mq-deadline. > > Unlike zone write locking which operates on requests, zone write > plugging operates on BIOs. A zone write plug is simply a BIO list that > is atomically manipulated using a spinlock and a kblockd submission > work. A write BIO to a zone is "plugged" to delay its execution if a > write BIO for the same zone was already issued, that is, if a write > request for the same zone is being executed. The next plugged BIO is > unplugged and issued once the write request completes. > > This mechanism allows to: > - Untangles zone write ordering from block IO schedulers. This allows > removing the restriction on using only mq-deadline for zoned block > devices. Any block IO scheduler, including "none" can be used. > - Zone write plugging operates on BIOs instead of requests. Plugged > BIOs waiting for execution thus do not hold scheduling tags and thus > are not preventing other BIOs to proceed (reads or writes to other > zones). Depending on the workload, this can significantly improve > the device use and performance. > - Both blk-mq (request) based zoned devices and BIO-based devices (e.g. > device mapper) can use zone write plugging. It is mandatory for the > former but optional for the latter: BIO-based driver can use zone > write plugging to implement write ordering guarantees, or the drivers > can implement their own if needed. > - The code is less invasive in the block layer and is mostly limited to > blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and > bio.c. > > Zone write plugging is implemented using struct blk_zone_wplug. This > structurei includes a spinlock, a BIO list and a work structure to > handle the submission of plugged BIOs. > > Plugging of zone write BIOs is done using the function > blk_zone_write_plug_bio() which returns false if a BIO execution does > not need to be delayed and true otherwise. This function is called > from blk_mq_submit_bio() after a BIO is split to avoid large BIOs > spanning multiple zones which would cause mishandling of zone write > plugging. This enables by default zone write plugging for any mq > request-based block device. BIO-based device drivers can also use zone > write plugging by expliclty calling blk_zone_write_plug_bio() in their > ->submit_bio method. For such devices, the driver must ensure that a > BIO passed to blk_zone_write_plug_bio() is already split and not > straddling zone boundaries. > > Only write and write zeroes BIOs are plugged. Zone write plugging does > not introduce any significant overhead for other operations. A BIO that > is being handled through zone write plugging is flagged using the new > BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with > this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag. > The completion processing of BIOs and requests flagged trigger > respectively calls to the functions blk_zone_write_plug_bio_endio() and > blk_zone_write_plug_complete_request(). The latter function is used to > trigger submission of the next plugged BIO using the zone plug work. > blk_zone_write_plug_bio_endio() does the same for BIO-based devices. > This ensures that at any time, at most one request (blk-mq devices) or > one BIO (BIO-based devices) are being executed for any zone. The > handling of zone write plug using a per-zone plug spinlock maximizes > parrallelism and device usage by allowing multiple zones to be writen > simultaneously without lock contention. > > Zone write plugging ignores flush BIOs without data. Hovever, any flush > BIO that has data is always plugged so that the write part of the flush > sequence is serialized with other regular writes. > > Given that any BIO handled through zone write plugging will be the only > BIO in flight for the target zone when it is executed, the unplugging > and submission of a BIO will have no chance of successfully merging with > plugged requests or requests in the scheduler. To overcome this > potential performance loss, blk_mq_submit_bio() calls the function > blk_zone_write_plug_attempt_merge() to try to merge other plugged BIOs > with the one just unplugged. Successful merging is signaled using > blk_zone_write_plug_bio_merged(), called from bio_attempt_back_merge(). > Furthermore, to avoid recalculating the number of segments of plugged > BIOs to attempt merging, the number of segments of a plugged BIO is > saved using the new struct bio field __bi_nr_segments. To avoid growing > the size of struct bio, this field is added as a union with the > bio_cookie field. This is safe to do as polling is always disabled for > plugged BIOs. > > When BIOs are plugged in a zone write plug, the device request queue > usage counter is always incremented. This kept and reused when the > plugged BIO is unplugged and submitted again using > submit_bio_noacct_nocheck(). For this case, the unplugged BIO is already > flagged with BIO_ZONE_WRITE_PLUGGING and blk_mq_submit_bio() proceeds > directly to allocating a new request for the BIO, re-using the usage > reference count taken when the BIO was plugged. This extra reference > count is dropped in blk_zone_write_plug_attempt_merge() for any plugged > BIO that is successfully merged. Given that BIO-based devices will not > take this path, the extra reference is dropped when a plugged BIO is > unplugged and submitted. > > To match the new data structures used for zoned disks, the function > disk_free_zone_bitmaps() is renamed to the more generic > disk_free_zone_resources(). > > This commit contains contributions from Christoph Hellwig <hch@xxxxxx>. > > Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx> > --- > block/bio.c | 7 + > block/blk-merge.c | 11 + > block/blk-mq.c | 28 +++ > block/blk-zoned.c | 408 +++++++++++++++++++++++++++++++++++++- > block/blk.h | 32 ++- > block/genhd.c | 2 +- > include/linux/blk-mq.h | 2 + > include/linux/blk_types.h | 8 +- > include/linux/blkdev.h | 8 + > 9 files changed, 496 insertions(+), 10 deletions(-) > > diff --git a/block/bio.c b/block/bio.c > index b9642a41f286..c8b0f7e8c713 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1581,6 +1581,13 @@ void bio_endio(struct bio *bio) > if (!bio_integrity_endio(bio)) > return; > > + /* > + * For BIOs handled through a zone write plugs, signal the end of the > + * BIO to the zone write plug to submit the next plugged BIO. > + */ > + if (bio_zone_write_plugging(bio)) > + blk_zone_write_plug_bio_endio(bio); > + > rq_qos_done_bio(bio); > > if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) { > diff --git a/block/blk-merge.c b/block/blk-merge.c > index a1ef61b03e31..2b5489cd9c65 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -377,6 +377,7 @@ struct bio *__bio_split_to_limits(struct bio *bio, > blkcg_bio_issue_init(split); > bio_chain(split, bio); > trace_block_split(split, bio->bi_iter.bi_sector); > + WARN_ON_ONCE(bio_zone_write_plugging(bio)); > submit_bio_noacct(bio); > return split; > } > @@ -980,6 +981,9 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req, > > blk_update_mixed_merge(req, bio, false); > > + if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING) > + blk_zone_write_plug_bio_merged(bio); > + > req->biotail->bi_next = bio; > req->biotail = bio; > req->__data_len += bio->bi_iter.bi_size; > @@ -995,6 +999,13 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req, > { > const blk_opf_t ff = bio_failfast(bio); > > + /* > + * A front merge for zone writes can happen only if the user submitted > + * writes out of order. Do not attempt this to let the write fail. > + */ > + if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING) > + return BIO_MERGE_FAILED; > + > if (!ll_front_merge_fn(req, bio, nr_segs)) > return BIO_MERGE_FAILED; > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index f02e486a02ae..aa49bebf1199 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -830,6 +830,9 @@ static void blk_complete_request(struct request *req) > bio = next; > } while (bio); > > + if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING) > + blk_zone_write_plug_complete_request(req); > + > /* > * Reset counters so that the request stacking driver > * can find how many bytes remain in the request > @@ -943,6 +946,9 @@ bool blk_update_request(struct request *req, blk_status_t error, > * completely done > */ > if (!req->bio) { > + if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING) > + blk_zone_write_plug_complete_request(req); > + > /* > * Reset counters so that the request stacking driver > * can find how many bytes remain in the request > @@ -2975,6 +2981,17 @@ void blk_mq_submit_bio(struct bio *bio) > struct request *rq; > blk_status_t ret; > > + /* > + * A BIO that was released form a zone write plug has already been > + * through the preparation in this function, already holds a reference > + * on the queue usage counter, and is the only write BIO in-flight for > + * the target zone. Go straight to allocating a request for it. > + */ > + if (bio_zone_write_plugging(bio)) { > + nr_segs = bio->__bi_nr_segments; > + goto new_request; > + } > + > bio = blk_queue_bounce(bio, q); > bio_set_ioprio(bio); > > @@ -3001,7 +3018,11 @@ void blk_mq_submit_bio(struct bio *bio) > if (blk_mq_attempt_bio_merge(q, bio, nr_segs)) > goto queue_exit; > > + if (blk_queue_is_zoned(q) && blk_zone_write_plug_bio(bio, nr_segs)) > + goto queue_exit; > + > if (!rq) { > +new_request: > rq = blk_mq_get_new_requests(q, plug, bio, nr_segs); > if (unlikely(!rq)) > goto queue_exit; > @@ -3017,8 +3038,12 @@ void blk_mq_submit_bio(struct bio *bio) > > ret = blk_crypto_rq_get_keyslot(rq); > if (ret != BLK_STS_OK) { > + bool zwplugging = bio_zone_write_plugging(bio); > + > bio->bi_status = ret; > bio_endio(bio); > + if (zwplugging) > + blk_zone_write_plug_complete_request(rq); > blk_mq_free_request(rq); > return; > } > @@ -3026,6 +3051,9 @@ void blk_mq_submit_bio(struct bio *bio) > if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq)) > return; > > + if (bio_zone_write_plugging(bio)) > + blk_zone_write_plug_attempt_merge(rq); > + > if (plug) { > blk_add_rq_to_plug(plug, rq); > return; > diff --git a/block/blk-zoned.c b/block/blk-zoned.c > index d343e5756a9c..f6d4f511b664 100644 > --- a/block/blk-zoned.c > +++ b/block/blk-zoned.c > @@ -7,11 +7,11 @@ > * > * Copyright (c) 2016, Damien Le Moal > * Copyright (c) 2016, Western Digital > + * Copyright (c) 2024, Western Digital Corporation or its affiliates. > */ > > #include <linux/kernel.h> > #include <linux/module.h> > -#include <linux/rbtree.h> > #include <linux/blkdev.h> > #include <linux/blk-mq.h> > #include <linux/mm.h> > @@ -19,6 +19,7 @@ > #include <linux/sched/mm.h> > > #include "blk.h" > +#include "blk-mq-sched.h" > > #define ZONE_COND_NAME(name) [BLK_ZONE_COND_##name] = #name > static const char *const zone_cond_name[] = { > @@ -33,6 +34,27 @@ static const char *const zone_cond_name[] = { > }; > #undef ZONE_COND_NAME > > +/* > + * Per-zone write plug. > + */ > +struct blk_zone_wplug { > + spinlock_t lock; > + unsigned int flags; > + struct bio_list bio_list; > + struct work_struct bio_work; > +}; > + > +/* > + * Zone write plug flags bits: > + * - BLK_ZONE_WPLUG_CONV: Indicate that the zone is a conventional one. Writes > + * to these zones are never plugged. > + * - BLK_ZONE_WPLUG_PLUGGED: Indicate that the zone write plug is plugged, > + * that is, that write BIOs are being throttled due to a write BIO already > + * being executed or the zone write plug bio list is not empty. > + */ > +#define BLK_ZONE_WPLUG_CONV (1U << 0) > +#define BLK_ZONE_WPLUG_PLUGGED (1U << 1) BLK_ZONE_WPLUG_PLUGGED == !bio_list_empty(&zwplug->bio_list), so looks this flag isn't necessary. > + > /** > * blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX. > * @zone_cond: BLK_ZONE_COND_XXX. > @@ -429,12 +451,374 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode, > return ret; > } > > -void disk_free_zone_bitmaps(struct gendisk *disk) > +#define blk_zone_wplug_lock(zwplug, flags) \ > + spin_lock_irqsave(&zwplug->lock, flags) > + > +#define blk_zone_wplug_unlock(zwplug, flags) \ > + spin_unlock_irqrestore(&zwplug->lock, flags) > + > +static inline void blk_zone_wplug_bio_io_error(struct bio *bio) > +{ > + struct request_queue *q = bio->bi_bdev->bd_disk->queue; > + > + bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING); > + bio_io_error(bio); > + blk_queue_exit(q); > +} > + > +static int blk_zone_wplug_abort(struct gendisk *disk, > + struct blk_zone_wplug *zwplug) > +{ > + struct bio *bio; > + int nr_aborted = 0; > + > + while ((bio = bio_list_pop(&zwplug->bio_list))) { > + blk_zone_wplug_bio_io_error(bio); > + nr_aborted++; > + } > + > + return nr_aborted; > +} > + > +/* > + * Return the zone write plug for sector in sequential write required zone. > + * Given that conventional zones have no write ordering constraints, NULL is > + * returned for sectors in conventional zones, to indicate that zone write > + * plugging is not needed. > + */ > +static inline struct blk_zone_wplug * > +disk_lookup_zone_wplug(struct gendisk *disk, sector_t sector) > +{ > + struct blk_zone_wplug *zwplug; > + > + if (WARN_ON_ONCE(!disk->zone_wplugs)) > + return NULL; > + > + zwplug = &disk->zone_wplugs[disk_zone_no(disk, sector)]; > + if (zwplug->flags & BLK_ZONE_WPLUG_CONV) > + return NULL; > + return zwplug; > +} > + > +static inline struct blk_zone_wplug *bio_lookup_zone_wplug(struct bio *bio) > +{ > + return disk_lookup_zone_wplug(bio->bi_bdev->bd_disk, > + bio->bi_iter.bi_sector); > +} > + > +static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug, > + struct bio *bio, unsigned int nr_segs) > +{ > + /* > + * Keep a reference on the BIO request queue usage. This reference will > + * be dropped either if the BIO is failed or after it is issued and > + * completes. > + */ > + percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter); It is fragile to get nested usage_counter, and same with grabbing/releasing it from different contexts or even functions, and it could be much better to just let block layer maintain it. >From patch 23's change: + * Zoned block device information. Reads of this information must be + * protected with blk_queue_enter() / blk_queue_exit(). Modifying this Anytime if there is in-flight bio, the block device is opened, so both gendisk and request_queue are live, so not sure if this .q_usage_counter protection is needed. + * information is only allowed while no requests are being processed. + * See also blk_mq_freeze_queue() and blk_mq_unfreeze_queue(). */ > + > + /* > + * The BIO is being plugged and thus will have to wait for the on-going > + * write and for all other writes already plugged. So polling makes > + * no sense. > + */ > + bio_clear_polled(bio); > + > + /* > + * Reuse the poll cookie field to store the number of segments when > + * split to the hardware limits. > + */ > + bio->__bi_nr_segments = nr_segs; > + > + /* > + * We always receive BIOs after they are split and ready to be issued. > + * The block layer passes the parts of a split BIO in order, and the > + * user must also issue write sequentially. So simply add the new BIO > + * at the tail of the list to preserve the sequential write order. > + */ > + bio_list_add(&zwplug->bio_list, bio); > +} > + > +/* > + * Called from bio_attempt_back_merge() when a BIO was merged with a request. > + */ > +void blk_zone_write_plug_bio_merged(struct bio *bio) > +{ > + bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING); > +} > + > +/* > + * Attempt to merge plugged BIOs with a newly formed request of a BIO that went > + * through zone write plugging (either a new BIO or one that was unplugged). > + */ > +void blk_zone_write_plug_attempt_merge(struct request *req) > +{ > + struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(req->bio); > + sector_t req_back_sector = blk_rq_pos(req) + blk_rq_sectors(req); > + struct request_queue *q = req->q; > + unsigned long flags; > + struct bio *bio; > + > + /* > + * Completion of this request needs to be handled with > + * blk_zone_write_complete_request(). > + */ > + req->rq_flags |= RQF_ZONE_WRITE_PLUGGING; > + > + if (blk_queue_nomerges(q)) > + return; > + > + /* > + * Walk through the list of plugged BIOs to check if they can be merged > + * into the back of the request. > + */ > + blk_zone_wplug_lock(zwplug, flags); > + while ((bio = bio_list_peek(&zwplug->bio_list))) { > + if (bio->bi_iter.bi_sector != req_back_sector || > + !blk_rq_merge_ok(req, bio)) > + break; > + > + WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE_ZEROES && > + !bio->__bi_nr_segments); > + > + bio_list_pop(&zwplug->bio_list); > + if (bio_attempt_back_merge(req, bio, bio->__bi_nr_segments) != > + BIO_MERGE_OK) { > + bio_list_add_head(&zwplug->bio_list, bio); > + break; > + } > + > + /* > + * Drop the extra reference on the queue usage we got when > + * plugging the BIO. > + */ > + blk_queue_exit(q); > + > + req_back_sector += bio_sectors(bio); > + } > + blk_zone_wplug_unlock(zwplug, flags); > +} > + > +static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs) > +{ > + struct blk_zone_wplug *zwplug; > + unsigned long flags; > + > + /* > + * BIOs must be fully contained within a zone so that we use the correct > + * zone write plug for the entire BIO. For blk-mq devices, the block > + * layer should already have done any splitting required to ensure this > + * and this BIO should thus not be straddling zone boundaries. For > + * BIO-based devices, it is the responsibility of the driver to split > + * the bio before submitting it. > + */ > + if (WARN_ON_ONCE(bio_straddle_zones(bio))) { > + bio_io_error(bio); > + return true; > + } > + > + zwplug = bio_lookup_zone_wplug(bio); > + if (!zwplug) > + return false; > + > + blk_zone_wplug_lock(zwplug, flags); > + > + /* Indicate that this BIO is being handled using zone write plugging. */ > + bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING); > + > + /* > + * If the zone is already plugged, add the BIO to the plug BIO list. > + * Otherwise, plug and let the BIO execute. > + */ > + if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) { > + blk_zone_wplug_add_bio(zwplug, bio, nr_segs); > + blk_zone_wplug_unlock(zwplug, flags); > + return true; > + } > + > + zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED; > + > + blk_zone_wplug_unlock(zwplug, flags); > + > + return false; > +} > + > +/** > + * blk_zone_write_plug_bio - Handle a zone write BIO with zone write plugging > + * @bio: The BIO being submitted > + * > + * Handle write and write zeroes operations using zone write plugging. > + * Return true whenever @bio execution needs to be delayed through the zone > + * write plug. Otherwise, return false to let the submission path process > + * @bio normally. > + */ > +bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs) > +{ > + if (!bio->bi_bdev->bd_disk->zone_wplugs) > + return false; > + > + /* > + * If the BIO already has the plugging flag set, then it was already > + * handled through this path and this is a submission from the zone > + * plug bio submit work. > + */ > + if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING)) > + return false; > + > + /* > + * We do not need to do anything special for empty flush BIOs, e.g > + * BIOs such as issued by blkdev_issue_flush(). The is because it is > + * the responsibility of the user to first wait for the completion of > + * write operations for flush to have any effect on the persistence of > + * the written data. > + */ > + if (op_is_flush(bio->bi_opf) && !bio_sectors(bio)) > + return false; > + > + /* > + * Regular writes and write zeroes need to be handled through the target > + * zone write plug. This includes writes with REQ_FUA | REQ_PREFLUSH > + * which may need to go through the flush machinery depending on the > + * target device capabilities. Plugging such writes is fine as the flush > + * machinery operates at the request level, below the plug, and > + * completion of the flush sequence will go through the regular BIO > + * completion, which will handle zone write plugging. > + */ > + switch (bio_op(bio)) { > + case REQ_OP_WRITE: > + case REQ_OP_WRITE_ZEROES: > + return blk_zone_wplug_handle_write(bio, nr_segs); > + default: > + return false; > + } > + > + return false; > +} > +EXPORT_SYMBOL_GPL(blk_zone_write_plug_bio); > + > +static void blk_zone_write_plug_unplug_bio(struct blk_zone_wplug *zwplug) > +{ > + unsigned long flags; > + > + blk_zone_wplug_lock(zwplug, flags); > + > + /* Schedule submission of the next plugged BIO if we have one. */ > + if (!bio_list_empty(&zwplug->bio_list)) > + kblockd_schedule_work(&zwplug->bio_work); > + else > + zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED; > + > + blk_zone_wplug_unlock(zwplug, flags); > +} > + > +void blk_zone_write_plug_bio_endio(struct bio *bio) > +{ > + /* Make sure we do not see this BIO again by clearing the plug flag. */ > + bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING); > + > + /* > + * For BIO-based devices, blk_zone_write_plug_complete_request() > + * is not called. So we need to schedule execution of the next > + * plugged BIO here. > + */ > + if (bio->bi_bdev->bd_has_submit_bio) { > + struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio); > + > + blk_zone_write_plug_unplug_bio(zwplug); > + } > +} > + > +void blk_zone_write_plug_complete_request(struct request *req) > +{ > + struct gendisk *disk = req->q->disk; > + struct blk_zone_wplug *zwplug = > + disk_lookup_zone_wplug(disk, req->__sector); > + > + req->rq_flags &= ~RQF_ZONE_WRITE_PLUGGING; > + > + blk_zone_write_plug_unplug_bio(zwplug); > +} > + > +static void blk_zone_wplug_bio_work(struct work_struct *work) > +{ > + struct blk_zone_wplug *zwplug = > + container_of(work, struct blk_zone_wplug, bio_work); > + unsigned long flags; > + struct bio *bio; > + > + /* > + * Unplug and submit the next plugged BIO. If we do not have any, clear > + * the plugged flag. > + */ > + blk_zone_wplug_lock(zwplug, flags); > + > + bio = bio_list_pop(&zwplug->bio_list); > + if (!bio) { > + zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED; > + blk_zone_wplug_unlock(zwplug, flags); > + return; > + } > + > + blk_zone_wplug_unlock(zwplug, flags); > + > + /* > + * blk-mq devices will reuse the reference on the request queue usage > + * we took when the BIO was plugged, but the submission path for > + * BIO-based devices will not do that. So drop this reference here. > + */ > + if (bio->bi_bdev->bd_has_submit_bio) > + blk_queue_exit(bio->bi_bdev->bd_disk->queue); But I don't see where this reference is reused for blk-mq in this patch, care to point it out? Thanks, Ming