On 2020/05/22 21:18, Coly Li wrote: > Fortunately the zoned device management interface is designed to use > bio operations, the bcache code just needs to do a little change to > handle the zone management bios. > > Bcache driver needs to handle 5 zone management bios for now, all of > them are handled by cached_dev_nodata() since their bi_size values > are 0. > - REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE, REQ_OP_ZONE_FINISH: > The LBA conversion is already done in cached_dev_make_request(), > just simply send such zone management bios to backing device is > enough. > - REQ_OP_ZONE_RESET, REQ_OP_ZONE_RESET_ALL: > If thre is cached data (clean or dirty) on cache device covered > by the LBA range of resetting zone(s), such cached data must be The first part of the sentence is a little unclear... Did you mean: If the cache device holds data of the target zones, cache invalidation is needed before forwarding... > invalidate from the cache device before forwarding the zone reset > bios to the backing device. Otherwise data inconsistency or further > corruption may happen from the view of bcache device. > The difference of REQ_OP_ZONE_RESET_ALL and REQ_OP_ZONE_RESET is > when the zone management bio is to reset all zones, send all zones > number reported by the bcache device (s->d->disk->queue->nr_zones) > into bch_data_invalidate_zones() as parameter 'size_t nr_zones'. If > only reset a single zone, just set 1 as 'size_t nr_zones'. > > By supporting zone managememnt bios with this patch, now a bcache device s/managememnt/management > can be formatted by zonefs, and the zones can be reset by truncate -s 0 > on the zone files under seq/ directory. Supporting REQ_OP_ZONE_RESET_ALL > makes the whole disk zones reset much faster. In my testing on a 14TB > zoned SMR hard drive, 1 by 1 resetting 51632 seq zones by sending > REQ_OP_ZONE_RESET bios takes 4m34s, by sending a single > REQ_OP_ZONE_RESET_ALL bio takes 12s, which is 22x times faster. > > REQ_OP_ZONE_RESET_ALL is supported by bcache only when the backing device > supports it. So the bcache queue flag is set QUEUE_FLAG_ZONE_RESETALL on > only when queue of backing device has such flag, which can be checked by > calling blk_queue_zone_resetall() on backing device's request queue. > > Signed-off-by: Coly Li <colyli@xxxxxxx> > Cc: Damien Le Moal <damien.lemoal@xxxxxxx> > Cc: Hannes Reinecke <hare@xxxxxxxx> > Cc: Johannes Thumshirn <johannes.thumshirn@xxxxxxx> > --- > Changelog: > v2: an improved version without any generic block layer change. > v1: initial version depends on other generic block layer changes. > > drivers/md/bcache/request.c | 99 ++++++++++++++++++++++++++++++++++++- > drivers/md/bcache/super.c | 2 + > 2 files changed, 100 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 34f63da2338d..700b8ab5dec9 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -1052,18 +1052,115 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s) > continue_at(cl, cached_dev_write_complete, NULL); > } > > +/* > + * Invalidate the LBA range on cache device which is covered by the > + * the resetting zones. > + */ > +static int bch_data_invalidate_zones(struct closure *cl, > + size_t zone_sector, > + size_t nr_zones) No need for the line break after the second argument. > +{ > + struct search *s = container_of(cl, struct search, cl); > + struct data_insert_op *iop = &s->iop; > + int max_keys, free_keys; > + size_t start = zone_sector; > + int ret; > + > + max_keys = (block_bytes(iop->c) - sizeof(struct jset)) / > + sizeof(uint64_t); > + bch_keylist_init(&iop->insert_keys); > + ret = bch_keylist_realloc(&iop->insert_keys, max_keys, iop->c); > + if (ret < 0) > + return -ENOMEM; > + > + while (nr_zones-- > 0) { > + atomic_t *journal_ref = NULL; > + size_t size = s->d->disk->queue->limits.chunk_sectors; > +more_keys: > + bch_keylist_reset(&iop->insert_keys); > + free_keys = max_keys; > + > + while (size > 0 && free_keys >= 2) { > + size_t sectors = min_t(size_t, size, > + 1U << (KEY_SIZE_BITS - 1)); > + > + bch_keylist_add(&iop->insert_keys, > + &KEY(iop->inode, start, sectors)); > + start += sectors; > + size -= sectors; > + free_keys -= 2; > + } > + > + /* Insert invalidate keys into btree */ > + journal_ref = bch_journal(iop->c, &iop->insert_keys, NULL); > + if (!journal_ref) { > + ret = -EIO; > + break; > + } > + > + ret = bch_btree_insert(iop->c, > + &iop->insert_keys, journal_ref, NULL); > + atomic_dec_bug(journal_ref); > + if (ret < 0) > + break; > + > + if (size) > + goto more_keys; > + } > + > + bch_keylist_free(&iop->insert_keys); > + > + return ret; > +} > + > static void cached_dev_nodata(struct closure *cl) > { > struct search *s = container_of(cl, struct search, cl); > struct bio *bio = &s->bio.bio; > + int nr_zones = 1; > > if (s->iop.flush_journal) > bch_journal_meta(s->iop.c, cl); > > - /* If it's a flush, we send the flush to the backing device too */ > + if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL || > + bio_op(bio) == REQ_OP_ZONE_RESET) { > + int err = 0; > + /* > + * If this is REQ_OP_ZONE_RESET_ALL bio, cached data > + * covered by all zones should be invalidate from the s/invalidate/invalidated > + * cache device. > + */ > + if (bio_op(bio) == REQ_OP_ZONE_RESET_ALL) > + nr_zones = s->d->disk->queue->nr_zones; Not: sending a REQ_OP_ZONE_RESET BIO to a conventional zone will be failed by the disk... This is not allowed by the ZBC/ZAC specs. So invalidation the cache for conventional zones is not really necessary. But as an initial support, I think this is fine. This can be optimized later. > + > + err = bch_data_invalidate_zones( > + cl, bio->bi_iter.bi_sector, nr_zones); > + > + if (err < 0) { > + s->iop.status = errno_to_blk_status(err); > + /* debug, should be removed before post patch */ > + bio->bi_status = BLK_STS_TIMEOUT; You did not remove it :) > + /* set by bio_cnt_set() in do_bio_hook() */ > + bio_put(bio); > + /* > + * Invalidate cached data fails, don't send > + * the zone reset bio to backing device and > + * return failure. Otherwise potential data > + * corruption on bcache device may happen. > + */ > + goto continue_bio_complete; > + } > + > + } > + > + /* > + * For flush or zone management bios, of cause > + * they should be sent to backing device too. > + */ > bio->bi_end_io = backing_request_endio; > closure_bio_submit(s->iop.c, bio, cl); You cannot submit a REQ_OP_ZONE_RESET_ALL to the backing dev here, at least not unconditionally. The problem is that if the backing dev doe not have any conventional zones at its LBA 0, REQ_OP_ZONE_RESET_ALL will really reset all zones, including the first zone of the device that contains bcache super block. So you will loose/destroy the bcache setup. You probably did not notice this because your test drive has conventional zones at LBA 0 and reset all does not have any effect on conventional zones... The easy way to deal with this is to not set the QUEUE_FLAG_ZONE_RESETALL flag for the bcache device queue. If it is not set, the block layer will automatically issue only single zone REQ_OP_ZONE_RESET BIOs. That is slower, yes, but that cannot be avoided when the backend disk does not have any conventional zones. The QUEUE_FLAG_ZONE_RESETALL flag can be kept if the backend disk first zone containing the bcache super block is a conventional zone. > > +continue_bio_complete: > continue_at(cl, cached_dev_bio_complete, NULL); > } > > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index d5da7ad5157d..70c31950ec1b 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -1390,6 +1390,8 @@ static int bch_cached_dev_zone_init(struct cached_dev *dc) > */ > d_q->nr_zones = b_q->nr_zones - > (dc->sb.data_offset / d_q->limits.chunk_sectors); > + if (blk_queue_zone_resetall(b_q)) > + blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, d_q); See above comment. > } > > return 0; > -- Damien Le Moal Western Digital Research