Re: [PATCH 4/4] dm: Improve zone resource limits handling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, May 30, 2024 at 02:40:35PM +0900, Damien Le Moal wrote:
> The generic stacking of limits implemented in the block layer cannot
> correctly handle stacking of zone resource limits (max open zones and
> max active zones) because these limits are for an entire device but the
> stacking may be for a portion of that device (e.g. a dm-linear target
> that does not cover an entire block device). As a result, when DM
> devices are created on top of zoned block devices, the DM device never
> has any zone resource limits advertized, which is only correct if all
> underlying target devices also have no zone resource limits.
> If at least one target device has resource limits, the user may see
> either performance issues (if the max open zone limit of the device is
> exceeded) or write I/O errors if the max active zone limit of one of
> the underlying target devices is exceeded.
> 
> While it is very difficult to correctly and reliably stack zone resource
> limits in general, cases where targets are not sharing zone resources of
> the same device can be dealt with relatively easily. Such situation
> happens when a target maps all sequential zones of a zoned block device:
> for such mapping, other targets mapping other parts of the same zoned
> block device can only contain conventional zones and thus will not
> require any zone resource to correctly handle write operations.
> 
> For a mapped device constructed with such targets, which includes mapped
> devices constructed with targets mapping entire zoned block devices, the
> zone resource limits can be reliably determined using the non-zero
> minimum of the zone resource limits of all targets.
> 
> For mapped devices that include targets partially mapping the set of
> sequential write required zones of zoned block devices, instead of
> advertizing no zone resource limits, it is also better to set the mapped
> device limits to the non-zero minimum of the limits of all targets,
> capped with the number of sequential zones being mapped.
> While still not completely reliable, this allows indicating to the user
> that the underlying devices used have limits.
> 
> This commit improves zone resource limits handling as described above
> using the function dm_set_zone_resource_limits(). This function is
> executed from dm_set_zones_restrictions() and iterates the targets of a
> mapped device to evaluate the max open and max active zone limits. This
> relies on an internal "stacking" of the limits of the target devices
> combined with a direct counting of the number of sequential zones
> mapped by the target.
> 1) For a target mapping an entire zoned block device, the limits are
>    evaluated as the non-zero minimum of the limits of the target device
>    and of the mapped device.
> 2) For a target partially mapping a zoned block device, the number of
>    mapped sequential zones is compared to the total number of sequential
>    zones of the target device to determine the limits: if the target
>    maps all sequential write required zones of the device, then the
>    target can reliably inherit the device limits. Otherwise, the target
>    limits are set to the device limits capped by the number of mapped
>    sequential zones.
> With this evaluation done for each target, the mapped device zone
> resource limits are evaluated as the non-zero minimum of the limits of
> all the targets.
> 
> For configurations resulting in unreliable limits, a warning message is
> issued.
> 
> The counting of mapped sequential zones for the target is done using the
> new function dm_device_count_zones() which performs a report zones on
> the entire block device with the callback dm_device_count_zones_cb().
> This count of mapped sequential zones is used to determine if the mapped
> device contains only conventional zones. This allows simplifying
> dm_set_zones_restrictions() to not do a report zones. For mapped devices
> mapping only conventional zones, dm_set_zone_resource_limits() changes
> the mapped device to a regular device.
> 
> To further cleanup dm_set_zones_restrictions(), the message about the
> type of zone append (native or emulated) is moved inside
> dm_revalidate_zones().
> 
> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx>
> ---
>  drivers/md/dm-zone.c | 214 +++++++++++++++++++++++++++++++++++--------
>  1 file changed, 177 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 5d66d916730e..5f8b499529cf 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -145,17 +145,174 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
>  	}
>  }
>  
> +struct dm_device_zone_count {
> +	sector_t start;
> +	sector_t len;
> +	unsigned int total_nr_seq_zones;
> +	unsigned int target_nr_seq_zones;
> +};
> +
>  /*
> - * Count conventional zones of a mapped zoned device. If the device
> - * only has conventional zones, do not expose it as zoned.
> + * Count the total number of and the number of mapped sequential zones of a
> + * target zoned device.
>   */
> -static int dm_check_zoned_cb(struct blk_zone *zone, unsigned int idx,
> -			     void *data)
> +static int dm_device_count_zones_cb(struct blk_zone *zone,
> +				    unsigned int idx, void *data)
>  {
> -	unsigned int *nr_conv_zones = data;
> +	struct dm_device_zone_count *zc = data;
> +
> +	if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) {
> +		zc->total_nr_seq_zones++;
> +		if (zone->start >= zc->start &&
> +		    zone->start < zc->start + zc->len)
> +			zc->target_nr_seq_zones++;
> +	}
>  
> -	if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
> -		(*nr_conv_zones)++;
> +	return 0;
> +}
> +
> +static int dm_device_count_zones(struct dm_dev *dev,
> +				 struct dm_device_zone_count *zc)
> +{
> +	int ret;
> +
> +	ret = blkdev_report_zones(dev->bdev, 0, UINT_MAX,
> +				  dm_device_count_zones_cb, zc);
> +	if (ret < 0)
> +		return ret;
> +	if (!ret)
> +		return -EIO;
> +	return 0;
> +}
> +
> +struct dm_zone_resource_limits {
> +	unsigned int mapped_nr_seq_zones;
> +	unsigned int max_open_zones;
> +	unsigned int max_active_zones;
> +	bool reliable_limits;
> +};
> +
> +static int device_get_zone_resource_limits(struct dm_target *ti,
> +					   struct dm_dev *dev, sector_t start,
> +					   sector_t len, void *data)
> +{
> +	struct dm_zone_resource_limits *zlim = data;
> +	struct gendisk *disk = dev->bdev->bd_disk;
> +	int ret;
> +	struct dm_device_zone_count zc = {
> +		.start = start,
> +		.len = len,
> +	};
> +
> +	/*
> +	 * If the target is not the whole device, the device zone resources may
> +	 * be shared between different targets. Check this by counting the
> +	 * number of mapped sequential zones: if this number is smaller than the
> +	 * total number of sequential zones of the target device, then resource
> +	 * sharing may happen and the zone limits will not be reliable.
> +	 */
> +	ret = dm_device_count_zones(dev, &zc);
> +	if (ret) {
> +		DMERR("Count device %s zones failed",
> +		      disk->disk_name);
> +		return ret;
> +	}
> +
> +	zlim->mapped_nr_seq_zones += zc.target_nr_seq_zones;
> +
> +	/*
> +	 * If the target does not map any sequential zones, then we do not need
> +	 * any zone resource limits.
> +	 */
> +	if (!zc.target_nr_seq_zones)
> +		return 0;
> +
> +	/*
> +	 * If the target does not map all sequential zones, the limits
> +	 * will not be reliable.
> +	 */
> +	if (zc.target_nr_seq_zones < zc.total_nr_seq_zones)
> +		zlim->reliable_limits = false;
> +
> +	zlim->max_active_zones =
> +		min_not_zero(disk->queue->limits.max_active_zones,
> +			     zlim->max_active_zones);
> +	if (zlim->max_active_zones != UINT_MAX)
> +		zlim->max_active_zones =
> +			min(zlim->max_active_zones, zc.target_nr_seq_zones);
> +
> +	zlim->max_open_zones =
> +		min_not_zero(disk->queue->limits.max_open_zones,
> +			     zlim->max_open_zones);
> +	if (zlim->max_open_zones != UINT_MAX)
> +		zlim->max_open_zones =
> +			min3(zlim->max_open_zones, zc.target_nr_seq_zones,
> +			     zlim->max_active_zones);
> +
> +	return 0;
> +}
> +
> +static int dm_set_zone_resource_limits(struct mapped_device *md,
> +				struct dm_table *t, struct queue_limits *lim)
> +{
> +	struct gendisk *disk = md->disk;
> +	struct dm_zone_resource_limits zlim = {
> +		.max_open_zones = UINT_MAX,
> +		.max_active_zones = UINT_MAX,
> +		.reliable_limits = true,
> +	};
> +
> +	/* Get the zone resource limits from the targets. */
> +	for (unsigned int i = 0; i < t->num_targets; i++) {
> +		struct dm_target *ti = dm_table_get_target(t, i);
> +
> +		if (!ti->type->iterate_devices ||
> +		    ti->type->iterate_devices(ti,
> +				device_get_zone_resource_limits, &zlim)) {
> +			DMERR("Could not determine %s zone resource limits",
> +			      md->disk->disk_name);
> +			return -ENODEV;
> +		}
> +	}
> +
> +	/*
> +	 * If we only have conventional zones mapped, expose the mapped device
> +	 + as a regular device.
> +	 */
> +	if (!zlim.mapped_nr_seq_zones) {
> +		lim->max_open_zones = 0;
> +		lim->max_active_zones = 0;
> +		lim->max_zone_append_sectors = 0;
> +		lim->zone_write_granularity = 0;
> +		lim->zoned = false;
> +		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> +		md->nr_zones = 0;
> +		disk->nr_zones = 0;
> +		return 0;
> +	}
> +
> +	/*
> +	 * Warn if the mapped device is partially using zone resources of the
> +	 * target devices as that leads to unreliable limits, i.e. if another
> +	 * mapped device uses the same underlying devices, we cannot enforce
> +	 * zone limits to guarantee that writing will not lead to errors.
> +	 * Note that we really should return an error for such case but there is
> +	 * no easy way to find out if another mapped device uses the same
> +	 * underlying zoned devices.
> +	 */
> +	if (!zlim.reliable_limits)
> +		DMWARN("%s zone resource limits may be unreliable",
> +		       disk->disk_name);
> +
> +	if (zlim.max_open_zones >= zlim.mapped_nr_seq_zones)
> +		lim->max_open_zones = 0;
> +	else
> +		lim->max_open_zones = zlim.max_open_zones;
> +
> +	if (zlim.max_active_zones >= zlim.mapped_nr_seq_zones)
> +		lim->max_active_zones = 0;
> +	else
> +		lim->max_active_zones = zlim.max_active_zones;
>  
>  	return 0;
>  }
> @@ -172,8 +329,13 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
>  	int ret;
>  
>  	/* Revalidate only if something changed. */
> -	if (!disk->nr_zones || disk->nr_zones != md->nr_zones)
> +	if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
> +		DMINFO("%s using %s zone append",
> +		       md->disk->disk_name,
> +		       queue_emulates_zone_append(disk->queue) ?
> +		       "emulated" : "native");
>  		md->nr_zones = 0;
> +	}
>  
>  	if (md->nr_zones)
>  		return 0;
> @@ -224,8 +386,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
>  		struct queue_limits *lim)
>  {
>  	struct mapped_device *md = t->md;
> -	struct gendisk *disk = md->disk;
> -	unsigned int nr_conv_zones = 0;
>  	int ret;
>  
>  	/*
> @@ -244,36 +404,16 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
>  		return 0;
>  
>  	/*
> -	 * Count conventional zones to check that the mapped device will indeed 
> -	 * have sequential write required zones.
> +	 * Determine the max open and max active zone limits for the mapped
> +	 * device. For a mapped device containing only conventional zones, the
> +	 * mapped device is changed to be a regular block device, so exit early
> +	 * for such case.
>  	 */
> -	md->zone_revalidate_map = t;
> -	ret = dm_blk_report_zones(disk, 0, UINT_MAX,
> -				  dm_check_zoned_cb, &nr_conv_zones);
> -	md->zone_revalidate_map = NULL;
> -	if (ret < 0) {
> -		DMERR("Check zoned failed %d", ret);
> +	ret = dm_set_zone_resource_limits(md, t, lim);
> +	if (ret)
>  		return ret;
> -	}
> -
> -	/*
> -	 * If we only have conventional zones, expose the mapped device as
> -	 * a regular device.
> -	 */
> -	if (nr_conv_zones >= ret) {
> -		lim->max_open_zones = 0;
> -		lim->max_active_zones = 0;
> -		lim->zoned = false;
> -		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> -		disk->nr_zones = 0;
> +	if (!lim->zoned)
>  		return 0;
> -	}
> -
> -	if (!md->disk->nr_zones) {
> -		DMINFO("%s using %s zone append",
> -		       md->disk->disk_name,
> -		       queue_emulates_zone_append(q) ? "emulated" : "native");
> -	}

I would have moved this print in a separate commit.

Regardless:
Reviewed-by: Niklas Cassel <cassel@xxxxxxxxxx>




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux