Re: [RFC PATCH v2] bcache: export zoned information for backing device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2020/5/11 17:06, Damien Le Moal wrote:
> On 2020/05/11 1:52, Coly Li wrote:
>> This is a very basic zoned device support. With this patch, bcache
>> device is able to,
>> - Export zoned device attribution via sysfs
>> - Response report zones request, e.g. by command 'blkzone report'
>> But the bcache device is still NOT able to,
>> - Response any zoned device management request or IOCTL command
>>
>> Here are the testings I have done,
>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>   including all zone types.
>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>   in sectors.
>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>>   Operation not supported"
> 
> Weird. If report zones works, reset should also, modulo the zone size problem
> for the first zone. You may get EINVAL, but not ENOTTY.
> 

This is on purpose. Because reset the underlying zones layout may
corrupt the bcache mapping between cache device and backing device. So
such management commands are rejected.



>> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>>   values updated.
>>
>> All of these are very basic testings, if you have better testing
>> tools or cases, please offer me hint.
>>
>> Thanks in advance for your review and comments.
>>
>> Signed-off-by: Coly Li <colyli@xxxxxxx>
>> CC: Hannes Reinecke <hare@xxxxxxxx>
>> CC: Damien Le Moal <damien.lemoal@xxxxxxx>
>> CC: Johannes Thumshirn <johannes.thumshirn@xxxxxxx>
>> ---
>>  drivers/md/bcache/bcache.h  | 10 +++++
>>  drivers/md/bcache/request.c | 74 +++++++++++++++++++++++++++++++++++++
>>  drivers/md/bcache/super.c   | 51 +++++++++++++++++++++----
>>  3 files changed, 128 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index 74a9849ea164..0d298b48707f 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -221,6 +221,7 @@ BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
>>  struct search;
>>  struct btree;
>>  struct keybuf;
>> +struct bch_report_zones_args;
>>  
>>  struct keybuf_key {
>>  	struct rb_node		node;
>> @@ -277,6 +278,8 @@ struct bcache_device {
>>  			  struct bio *bio, unsigned int sectors);
>>  	int (*ioctl)(struct bcache_device *d, fmode_t mode,
>>  		     unsigned int cmd, unsigned long arg);
>> +	int (*report_zones)(struct bch_report_zones_args *args,
>> +			    unsigned int nr_zones);
>>  };
>>  
>>  struct io {
>> @@ -748,6 +751,13 @@ struct bbio {
>>  	struct bio		bio;
>>  };
>>  
>> +struct bch_report_zones_args {
>> +	struct bcache_device *bcache_device;
>> +	sector_t sector;
>> +	void *orig_data;
>> +	report_zones_cb orig_cb;
>> +};
>> +
>>  #define BTREE_PRIO		USHRT_MAX
>>  #define INITIAL_PRIO		32768U
>>  
>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>> index 71a90fbec314..bd50204caac7 100644
>> --- a/drivers/md/bcache/request.c
>> +++ b/drivers/md/bcache/request.c
>> @@ -1190,6 +1190,19 @@ blk_qc_t cached_dev_make_request(struct request_queue *q, struct bio *bio)
>>  		}
>>  	}
>>  
>> +	/*
>> +	 * zone management request may change the data layout and content
>> +	 * implicitly on backing device, which introduces unacceptible
> 
> s/unacceptible/unacceptable
> 
>> +	 * inconsistency between the backing device and the cache device.
>> +	 * Therefore all zone management related request will be denied here.
>> +	 */
>> +	if (op_is_zone_mgmt(bio->bi_opf)) {
>> +		pr_err_ratelimited("zoned device management request denied.");
>> +		bio->bi_status = BLK_STS_NOTSUPP;
>> +		bio_endio(bio);
>> +		return BLK_QC_T_NONE;
> 
> OK. That explains the operation not supported for "blkzone reset". I do not
> think this is good. With this, the drive will be writable only exactly once,
> without the possibility for the user to reset any zone and rewrite them. All
> zone compliant file systems will fail (f2fs, zonefs, btrfs coming).
> 
> At the very least REQ_OP_ZONE_RESET should be allowed and trigger an
> invalidation on the cache device of all blocks of the zone that are cached.
> 

Copied. Then I should find a method to invalid the cached data on SSD in
a proper way.

> Note: the zone management operations will need to be remapped like report zones,
> but in reverse: the BIO start sector must be increase by the zone size.
> 

Thanks for the hint :-)


>> +	}
>> +
>>  	generic_start_io_acct(q,
>>  			      bio_op(bio),
>>  			      bio_sectors(bio),
>> @@ -1233,6 +1246,24 @@ static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
>>  	if (dc->io_disable)
>>  		return -EIO;
>>  
>> +	/*
>> +	 * zone management ioctl commands may change the data layout
>> +	 * and content implicitly on backing device, which introduces
>> +	 * unacceptible inconsistency between the backing device and
>> +	 * the cache device. Therefore all zone management related
>> +	 * ioctl commands will be denied here.
>> +	 */
>> +	switch (cmd) {
>> +	case BLKRESETZONE:
>> +	case BLKOPENZONE:
>> +	case BLKCLOSEZONE:
>> +	case BLKFINISHZONE:
>> +		return -EOPNOTSUPP;
> 
> Same comment as above. Of note is that only BLKRESETZONE will result in cache
> inconsistency if not taken care of with an invalidation of the cached blocks of
> the zone on the cache device. Open and close operations have no effect on data.
> Finish zone will artificially increase the zone write pointer to the end of the
> zone to make it full but without actually writing any data. So there is no need
> I think to change anything on the cache device in that case.
> 

Copied. I will handle this.

One thing not cleared to me is, what is the purpose of BLKOPENZONE and
BLKCLOSEZONE, is it used to update writer pointer only ? Or it is also
used to set a specific zone to be offline ?


>> +	default:
>> +		/* Other commands fall through*/
>> +		break;
>> +	}
>> +
>>  	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
>>  }
>>  
>> @@ -1261,6 +1292,48 @@ static int cached_dev_congested(void *data, int bits)
>>  	return ret;
>>  }
>>  
>> +static int cached_dev_report_zones_cb(struct blk_zone *zone,
>> +				      unsigned int idx,
>> +				      void *data)
>> +{
>> +	struct bch_report_zones_args *args = data;
>> +	struct bcache_device *d = args->bcache_device;
>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>> +	unsigned long data_offset = dc->sb.data_offset;
>> +
>> +	if (zone->start >= data_offset) {
>> +		zone->start -= data_offset;
>> +		zone->wp -= data_offset;
> 
> Since the zone that contains the super block has to be ignored, the remapping of
> the zone start can be done unconditionally. For the write pointer remapping, you
> need to handle several cases: conventional zones, full zones and read-only zones
> do not have a valid write pointer value, so no updating. You also need to skip
> offline zones.
> 

Copied, I will fix here in next version.


>> +	} else {
>> +		/*
>> +		 * Normally the first zone is conventional zone,
>> +		 * we don't need to worry about how to maintain
>> +		 * the write pointer.
>> +		 * But the zone->len is special, because the
>> +		 * sector 0 on bcache device is 8KB offset on
>> +		 * backing device indeed.
>> +		 */
>> +		zone->len -= data_offset;
> 
> This should not be necessary if the first zone containing the super block is
> skipped entirely.
> 

Yes, it will be.


>> +	}
>> +
>> +	return args->orig_cb(zone, idx, args->orig_data);
>> +}
>> +
>> +static int cached_dev_report_zones(struct bch_report_zones_args *args,
>> +				   unsigned int nr_zones)
>> +{
>> +	struct bcache_device *d = args->bcache_device;
>> +	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
>> +	sector_t sector = args->sector + dc->sb.data_offset;
>> +
>> +	/* sector is real LBA of backing device */
>> +	return blkdev_report_zones(dc->bdev,
>> +				   sector,
>> +				   nr_zones,
>> +				   cached_dev_report_zones_cb,
>> +				   args);
>> +}
>> +
>>  void bch_cached_dev_request_init(struct cached_dev *dc)
>>  {
>>  	struct gendisk *g = dc->disk.disk;
>> @@ -1268,6 +1341,7 @@ void bch_cached_dev_request_init(struct cached_dev *dc)
>>  	g->queue->backing_dev_info->congested_fn = cached_dev_congested;
>>  	dc->disk.cache_miss			= cached_dev_cache_miss;
>>  	dc->disk.ioctl				= cached_dev_ioctl;
>> +	dc->disk.report_zones			= cached_dev_report_zones;
>>  }
>>  
>>  /* Flash backed devices */
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index d98354fa28e3..b7d496040cee 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -679,10 +679,31 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
>>  	return d->ioctl(d, mode, cmd, arg);
>>  }
>>  
>> +static int report_zones_dev(struct gendisk *disk,
>> +			    sector_t sector,
>> +			    unsigned int nr_zones,
>> +			    report_zones_cb cb,
>> +			    void *data)
>> +{
>> +	struct bcache_device *d = disk->private_data;
>> +	struct bch_report_zones_args args = {
>> +		.bcache_device = d,
>> +		.sector = sector,
>> +		.orig_data = data,
>> +		.orig_cb = cb,
>> +	};
>> +
>> +	if (d->report_zones)
>> +		return d->report_zones(&args, nr_zones);
> 
> It looks to me like this is not necessary. This function could just be
> cached_dev_report_zones() and you can drop the dc->disk.report_zones method.
> 

Yes, I will fix it.


>> +
>> +	return -EOPNOTSUPP;
>> +}
>> +
>>  static const struct block_device_operations bcache_ops = {
>>  	.open		= open_dev,
>>  	.release	= release_dev,
>>  	.ioctl		= ioctl_dev,
>> +	.report_zones	= report_zones_dev,
>>  	.owner		= THIS_MODULE,
>>  };
>>  
>> @@ -815,8 +836,12 @@ static void bcache_device_free(struct bcache_device *d)
>>  	closure_debug_destroy(&d->cl);
>>  }
>>  
>> -static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>> -			      sector_t sectors, make_request_fn make_request_fn)
>> +static int bcache_device_init(struct cached_dev *dc,
>> +			      struct bcache_device *d,
>> +			      unsigned int block_size,
>> +			      sector_t sectors,
>> +			      make_request_fn make_request_fn)
>> +
>>  {
>>  	struct request_queue *q;
>>  	const size_t max_stripes = min_t(size_t, INT_MAX,
>> @@ -886,6 +911,17 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>>  	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, d->disk->queue);
>>  	blk_queue_flag_set(QUEUE_FLAG_DISCARD, d->disk->queue);
>>  
>> +	/*
>> +	 * If this is for backing device registration, and it is an
>> +	 * zoned device (e.g. host-managed S.M.R. hard drive), set
>> +	 * up zoned device attribution properly for sysfs interface.
>> +	 */
>> +	if (dc && bdev_is_zoned(dc->bdev)) {
>> +		q->limits.zoned = bdev_zoned_model(dc->bdev);
>> +		q->nr_zones = blkdev_nr_zones(dc->bdev->bd_disk);
>> +		q->limits.chunk_sectors = bdev_zone_sectors(dc->bdev);
> 
> You need to call blk_revalidate_disk_zones() here:
> 
> 	blk_revalidate_disk_zones(dc->bdev->bd_disk);
> 
> but call it *after* setting the bc device capacity to
> 
> 	get_capacity(dc->bdev->bd_disk) - bdev_zone_sectors(dc->bdev);
> 
> Which I think is in fact the sectors argument to this function ?
> 
> With that information blk_revalidate_disk_zones() will check the zones and set
> q->nr_zones.
> 
> 

This is something I didn't notice. Yes I should revalidate teh zones.
Will do it in next version.


>> +	}
>> +
>>  	blk_queue_write_cache(q, true, true);
>>  
>>  	return 0;
>> @@ -1337,9 +1373,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>>  		dc->partial_stripes_expensive =
>>  			q->limits.raid_partial_stripes_expensive;
>>  
>> -	ret = bcache_device_init(&dc->disk, block_size,
>> -			 dc->bdev->bd_part->nr_sects - dc->sb.data_offset,
>> -			 cached_dev_make_request);
>> +	ret = bcache_device_init(dc, &dc->disk, block_size,
>> +			dc->bdev->bd_part->nr_sects - dc->sb.data_offset,
> 
> dc->sb.data_offset should be the device zone size (chunk sectors) to skip the
> entire first zone and preserve the "all zones have the same size" constraint.
> 


Sure, and the bcache-tools should be fixed to recognize zoned device as
well.

>> +			cached_dev_make_request);
>>  	if (ret)
>>  		return ret;
>>  
>> @@ -1451,8 +1487,9 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u)
>>  
>>  	kobject_init(&d->kobj, &bch_flash_dev_ktype);
>>  
>> -	if (bcache_device_init(d, block_bytes(c), u->sectors,
>> -			flash_dev_make_request))
>> +	if (bcache_device_init(NULL, d, block_bytes(c),
>> +			       u->sectors,
>> +			       flash_dev_make_request))
>>  		goto err;
>>  
>>  	bcache_device_attach(d, c, u - c->uuids);
>>
> 
> 

Thanks for your review. I will post v2 soon.

Coly Li




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux