On 2020/5/11 17:06, Damien Le Moal wrote: > On 2020/05/11 1:52, Coly Li wrote: >> This is a very basic zoned device support. With this patch, bcache >> device is able to, >> - Export zoned device attribution via sysfs >> - Response report zones request, e.g. by command 'blkzone report' >> But the bcache device is still NOT able to, >> - Response any zoned device management request or IOCTL command >> >> Here are the testings I have done, >> - read /sys/block/bcache0/queue/zoned, content is 'host-managed' >> - read /sys/block/bcache0/queue/nr_zones, content is number of zones >> including all zone types. >> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size >> in sectors. >> - run 'blkzone report /dev/bcache0', all zones information displayed. >> - run 'blkzone reset /dev/bcache0', operation is rejected with error >> information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed: >> Operation not supported" > > Weird. If report zones works, reset should also, modulo the zone size problem > for the first zone. You may get EINVAL, but not ENOTTY. > This is on purpose. Because reset the underlying zones layout may corrupt the bcache mapping between cache device and backing device. So such management commands are rejected. >> - Sequential writes by dd, I can see some zones' write pointer 'wptr' >> values updated. >> >> All of these are very basic testings, if you have better testing >> tools or cases, please offer me hint. >> >> Thanks in advance for your review and comments. >> >> Signed-off-by: Coly Li <colyli@xxxxxxx> >> CC: Hannes Reinecke <hare@xxxxxxxx> >> CC: Damien Le Moal <damien.lemoal@xxxxxxx> >> CC: Johannes Thumshirn <johannes.thumshirn@xxxxxxx> >> --- >> drivers/md/bcache/bcache.h | 10 +++++ >> drivers/md/bcache/request.c | 74 +++++++++++++++++++++++++++++++++++++ >> drivers/md/bcache/super.c | 51 +++++++++++++++++++++---- >> 3 files changed, 128 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h >> index 74a9849ea164..0d298b48707f 100644 >> --- a/drivers/md/bcache/bcache.h >> +++ b/drivers/md/bcache/bcache.h >> @@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1); >> struct search; >> struct btree; >> struct keybuf; >> +struct bch_report_zones_args; >> >> struct keybuf_key { >> struct rb_node node; >> @@ -277,6 +278,8 @@ struct bcache_device { >> struct bio *bio, unsigned int sectors); >> int (*ioctl)(struct bcache_device *d, fmode_t mode, >> unsigned int cmd, unsigned long arg); >> + int (*report_zones)(struct bch_report_zones_args *args, >> + unsigned int nr_zones); >> }; >> >> struct io { >> @@ -748,6 +751,13 @@ struct bbio { >> struct bio bio; >> }; >> >> +struct bch_report_zones_args { >> + struct bcache_device *bcache_device; >> + sector_t sector; >> + void *orig_data; >> + report_zones_cb orig_cb; >> +}; >> + >> #define BTREE_PRIO USHRT_MAX >> #define INITIAL_PRIO 32768U >> >> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c >> index 71a90fbec314..bd50204caac7 100644 >> --- a/drivers/md/bcache/request.c >> +++ b/drivers/md/bcache/request.c >> @@ -1190,6 +1190,19 @@ blk_qc_t cached_dev_make_request(struct request_queue *q, struct bio *bio) >> } >> } >> >> + /* >> + * zone management request may change the data layout and content >> + * implicitly on backing device, which introduces unacceptible > > s/unacceptible/unacceptable > >> + * inconsistency between the backing device and the cache device. >> + * Therefore all zone management related request will be denied here. >> + */ >> + if (op_is_zone_mgmt(bio->bi_opf)) { >> + pr_err_ratelimited("zoned device management request denied."); >> + bio->bi_status = BLK_STS_NOTSUPP; >> + bio_endio(bio); >> + return BLK_QC_T_NONE; > > OK. That explains the operation not supported for "blkzone reset". I do not > think this is good. With this, the drive will be writable only exactly once, > without the possibility for the user to reset any zone and rewrite them. All > zone compliant file systems will fail (f2fs, zonefs, btrfs coming). > > At the very least REQ_OP_ZONE_RESET should be allowed and trigger an > invalidation on the cache device of all blocks of the zone that are cached. > Copied. Then I should find a method to invalid the cached data on SSD in a proper way. > Note: the zone management operations will need to be remapped like report zones, > but in reverse: the BIO start sector must be increase by the zone size. > Thanks for the hint :-) >> + } >> + >> generic_start_io_acct(q, >> bio_op(bio), >> bio_sectors(bio), >> @@ -1233,6 +1246,24 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode, >> if (dc->io_disable) >> return -EIO; >> >> + /* >> + * zone management ioctl commands may change the data layout >> + * and content implicitly on backing device, which introduces >> + * unacceptible inconsistency between the backing device and >> + * the cache device. Therefore all zone management related >> + * ioctl commands will be denied here. >> + */ >> + switch (cmd) { >> + case BLKRESETZONE: >> + case BLKOPENZONE: >> + case BLKCLOSEZONE: >> + case BLKFINISHZONE: >> + return -EOPNOTSUPP; > > Same comment as above. Of note is that only BLKRESETZONE will result in cache > inconsistency if not taken care of with an invalidation of the cached blocks of > the zone on the cache device. Open and close operations have no effect on data. > Finish zone will artificially increase the zone write pointer to the end of the > zone to make it full but without actually writing any data. So there is no need > I think to change anything on the cache device in that case. > Copied. I will handle this. One thing not cleared to me is, what is the purpose of BLKOPENZONE and BLKCLOSEZONE, is it used to update writer pointer only ? Or it is also used to set a specific zone to be offline ? >> + default: >> + /* Other commands fall through*/ >> + break; >> + } >> + >> return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg); >> } >> >> @@ -1261,6 +1292,48 @@ static int cached_dev_congested(void *data, int bits) >> return ret; >> } >> >> +static int cached_dev_report_zones_cb(struct blk_zone *zone, >> + unsigned int idx, >> + void *data) >> +{ >> + struct bch_report_zones_args *args = data; >> + struct bcache_device *d = args->bcache_device; >> + struct cached_dev *dc = container_of(d, struct cached_dev, disk); >> + unsigned long data_offset = dc->sb.data_offset; >> + >> + if (zone->start >= data_offset) { >> + zone->start -= data_offset; >> + zone->wp -= data_offset; > > Since the zone that contains the super block has to be ignored, the remapping of > the zone start can be done unconditionally. For the write pointer remapping, you > need to handle several cases: conventional zones, full zones and read-only zones > do not have a valid write pointer value, so no updating. You also need to skip > offline zones. > Copied, I will fix here in next version. >> + } else { >> + /* >> + * Normally the first zone is conventional zone, >> + * we don't need to worry about how to maintain >> + * the write pointer. >> + * But the zone->len is special, because the >> + * sector 0 on bcache device is 8KB offset on >> + * backing device indeed. >> + */ >> + zone->len -= data_offset; > > This should not be necessary if the first zone containing the super block is > skipped entirely. > Yes, it will be. >> + } >> + >> + return args->orig_cb(zone, idx, args->orig_data); >> +} >> + >> +static int cached_dev_report_zones(struct bch_report_zones_args *args, >> + unsigned int nr_zones) >> +{ >> + struct bcache_device *d = args->bcache_device; >> + struct cached_dev *dc = container_of(d, struct cached_dev, disk); >> + sector_t sector = args->sector + dc->sb.data_offset; >> + >> + /* sector is real LBA of backing device */ >> + return blkdev_report_zones(dc->bdev, >> + sector, >> + nr_zones, >> + cached_dev_report_zones_cb, >> + args); >> +} >> + >> void bch_cached_dev_request_init(struct cached_dev *dc) >> { >> struct gendisk *g = dc->disk.disk; >> @@ -1268,6 +1341,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc) >> g->queue->backing_dev_info->congested_fn = cached_dev_congested; >> dc->disk.cache_miss = cached_dev_cache_miss; >> dc->disk.ioctl = cached_dev_ioctl; >> + dc->disk.report_zones = cached_dev_report_zones; >> } >> >> /* Flash backed devices */ >> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c >> index d98354fa28e3..b7d496040cee 100644 >> --- a/drivers/md/bcache/super.c >> +++ b/drivers/md/bcache/super.c >> @@ -679,10 +679,31 @@ static int ioctl_dev(struct block_device *b, fmode_t mode, >> return d->ioctl(d, mode, cmd, arg); >> } >> >> +static int report_zones_dev(struct gendisk *disk, >> + sector_t sector, >> + unsigned int nr_zones, >> + report_zones_cb cb, >> + void *data) >> +{ >> + struct bcache_device *d = disk->private_data; >> + struct bch_report_zones_args args = { >> + .bcache_device = d, >> + .sector = sector, >> + .orig_data = data, >> + .orig_cb = cb, >> + }; >> + >> + if (d->report_zones) >> + return d->report_zones(&args, nr_zones); > > It looks to me like this is not necessary. This function could just be > cached_dev_report_zones() and you can drop the dc->disk.report_zones method. > Yes, I will fix it. >> + >> + return -EOPNOTSUPP; >> +} >> + >> static const struct block_device_operations bcache_ops = { >> .open = open_dev, >> .release = release_dev, >> .ioctl = ioctl_dev, >> + .report_zones = report_zones_dev, >> .owner = THIS_MODULE, >> }; >> >> @@ -815,8 +836,12 @@ static void bcache_device_free(struct bcache_device *d) >> closure_debug_destroy(&d->cl); >> } >> >> -static int bcache_device_init(struct bcache_device *d, unsigned int block_size, >> - sector_t sectors, make_request_fn make_request_fn) >> +static int bcache_device_init(struct cached_dev *dc, >> + struct bcache_device *d, >> + unsigned int block_size, >> + sector_t sectors, >> + make_request_fn make_request_fn) >> + >> { >> struct request_queue *q; >> const size_t max_stripes = min_t(size_t, INT_MAX, >> @@ -886,6 +911,17 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size, >> blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, d->disk->queue); >> blk_queue_flag_set(QUEUE_FLAG_DISCARD, d->disk->queue); >> >> + /* >> + * If this is for backing device registration, and it is an >> + * zoned device (e.g. host-managed S.M.R. hard drive), set >> + * up zoned device attribution properly for sysfs interface. >> + */ >> + if (dc && bdev_is_zoned(dc->bdev)) { >> + q->limits.zoned = bdev_zoned_model(dc->bdev); >> + q->nr_zones = blkdev_nr_zones(dc->bdev->bd_disk); >> + q->limits.chunk_sectors = bdev_zone_sectors(dc->bdev); > > You need to call blk_revalidate_disk_zones() here: > > blk_revalidate_disk_zones(dc->bdev->bd_disk); > > but call it *after* setting the bc device capacity to > > get_capacity(dc->bdev->bd_disk) - bdev_zone_sectors(dc->bdev); > > Which I think is in fact the sectors argument to this function ? > > With that information blk_revalidate_disk_zones() will check the zones and set > q->nr_zones. > > This is something I didn't notice. Yes I should revalidate teh zones. Will do it in next version. >> + } >> + >> blk_queue_write_cache(q, true, true); >> >> return 0; >> @@ -1337,9 +1373,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size) >> dc->partial_stripes_expensive = >> q->limits.raid_partial_stripes_expensive; >> >> - ret = bcache_device_init(&dc->disk, block_size, >> - dc->bdev->bd_part->nr_sects - dc->sb.data_offset, >> - cached_dev_make_request); >> + ret = bcache_device_init(dc, &dc->disk, block_size, >> + dc->bdev->bd_part->nr_sects - dc->sb.data_offset, > > dc->sb.data_offset should be the device zone size (chunk sectors) to skip the > entire first zone and preserve the "all zones have the same size" constraint. > Sure, and the bcache-tools should be fixed to recognize zoned device as well. >> + cached_dev_make_request); >> if (ret) >> return ret; >> >> @@ -1451,8 +1487,9 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u) >> >> kobject_init(&d->kobj, &bch_flash_dev_ktype); >> >> - if (bcache_device_init(d, block_bytes(c), u->sectors, >> - flash_dev_make_request)) >> + if (bcache_device_init(NULL, d, block_bytes(c), >> + u->sectors, >> + flash_dev_make_request)) >> goto err; >> >> bcache_device_attach(d, c, u - c->uuids); >> > > Thanks for your review. I will post v2 soon. Coly Li