On Thu, May 30, 2024 at 02:40:35PM +0900, Damien Le Moal wrote: > The generic stacking of limits implemented in the block layer cannot > correctly handle stacking of zone resource limits (max open zones and > max active zones) because these limits are for an entire device but the > stacking may be for a portion of that device (e.g. a dm-linear target > that does not cover an entire block device). As a result, when DM > devices are created on top of zoned block devices, the DM device never > has any zone resource limits advertized, which is only correct if all > underlying target devices also have no zone resource limits. > If at least one target device has resource limits, the user may see > either performance issues (if the max open zone limit of the device is > exceeded) or write I/O errors if the max active zone limit of one of > the underlying target devices is exceeded. > > While it is very difficult to correctly and reliably stack zone resource > limits in general, cases where targets are not sharing zone resources of > the same device can be dealt with relatively easily. Such situation > happens when a target maps all sequential zones of a zoned block device: > for such mapping, other targets mapping other parts of the same zoned > block device can only contain conventional zones and thus will not > require any zone resource to correctly handle write operations. > > For a mapped device constructed with such targets, which includes mapped > devices constructed with targets mapping entire zoned block devices, the > zone resource limits can be reliably determined using the non-zero > minimum of the zone resource limits of all targets. > > For mapped devices that include targets partially mapping the set of > sequential write required zones of zoned block devices, instead of > advertizing no zone resource limits, it is also better to set the mapped > device limits to the non-zero minimum of the limits of all targets, > capped with the number of sequential zones being mapped. > While still not completely reliable, this allows indicating to the user > that the underlying devices used have limits. > > This commit improves zone resource limits handling as described above > using the function dm_set_zone_resource_limits(). This function is > executed from dm_set_zones_restrictions() and iterates the targets of a > mapped device to evaluate the max open and max active zone limits. This > relies on an internal "stacking" of the limits of the target devices > combined with a direct counting of the number of sequential zones > mapped by the target. > 1) For a target mapping an entire zoned block device, the limits are > evaluated as the non-zero minimum of the limits of the target device > and of the mapped device. > 2) For a target partially mapping a zoned block device, the number of > mapped sequential zones is compared to the total number of sequential > zones of the target device to determine the limits: if the target > maps all sequential write required zones of the device, then the > target can reliably inherit the device limits. Otherwise, the target > limits are set to the device limits capped by the number of mapped > sequential zones. > With this evaluation done for each target, the mapped device zone > resource limits are evaluated as the non-zero minimum of the limits of > all the targets. > > For configurations resulting in unreliable limits, a warning message is > issued. > > The counting of mapped sequential zones for the target is done using the > new function dm_device_count_zones() which performs a report zones on > the entire block device with the callback dm_device_count_zones_cb(). > This count of mapped sequential zones is used to determine if the mapped > device contains only conventional zones. This allows simplifying > dm_set_zones_restrictions() to not do a report zones. For mapped devices > mapping only conventional zones, dm_set_zone_resource_limits() changes > the mapped device to a regular device. > > To further cleanup dm_set_zones_restrictions(), the message about the > type of zone append (native or emulated) is moved inside > dm_revalidate_zones(). > > Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx> > --- > drivers/md/dm-zone.c | 214 +++++++++++++++++++++++++++++++++++-------- > 1 file changed, 177 insertions(+), 37 deletions(-) > > diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c > index 5d66d916730e..5f8b499529cf 100644 > --- a/drivers/md/dm-zone.c > +++ b/drivers/md/dm-zone.c > @@ -145,17 +145,174 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio) > } > } > > +struct dm_device_zone_count { > + sector_t start; > + sector_t len; > + unsigned int total_nr_seq_zones; > + unsigned int target_nr_seq_zones; > +}; > + > /* > - * Count conventional zones of a mapped zoned device. If the device > - * only has conventional zones, do not expose it as zoned. > + * Count the total number of and the number of mapped sequential zones of a > + * target zoned device. > */ > -static int dm_check_zoned_cb(struct blk_zone *zone, unsigned int idx, > - void *data) > +static int dm_device_count_zones_cb(struct blk_zone *zone, > + unsigned int idx, void *data) > { > - unsigned int *nr_conv_zones = data; > + struct dm_device_zone_count *zc = data; > + > + if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) { > + zc->total_nr_seq_zones++; > + if (zone->start >= zc->start && > + zone->start < zc->start + zc->len) > + zc->target_nr_seq_zones++; > + } > > - if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) > - (*nr_conv_zones)++; > + return 0; > +} > + > +static int dm_device_count_zones(struct dm_dev *dev, > + struct dm_device_zone_count *zc) > +{ > + int ret; > + > + ret = blkdev_report_zones(dev->bdev, 0, UINT_MAX, > + dm_device_count_zones_cb, zc); > + if (ret < 0) > + return ret; > + if (!ret) > + return -EIO; > + return 0; > +} > + > +struct dm_zone_resource_limits { > + unsigned int mapped_nr_seq_zones; > + unsigned int max_open_zones; > + unsigned int max_active_zones; > + bool reliable_limits; > +}; > + > +static int device_get_zone_resource_limits(struct dm_target *ti, > + struct dm_dev *dev, sector_t start, > + sector_t len, void *data) > +{ > + struct dm_zone_resource_limits *zlim = data; > + struct gendisk *disk = dev->bdev->bd_disk; > + int ret; > + struct dm_device_zone_count zc = { > + .start = start, > + .len = len, > + }; > + > + /* > + * If the target is not the whole device, the device zone resources may > + * be shared between different targets. Check this by counting the > + * number of mapped sequential zones: if this number is smaller than the > + * total number of sequential zones of the target device, then resource > + * sharing may happen and the zone limits will not be reliable. > + */ > + ret = dm_device_count_zones(dev, &zc); > + if (ret) { > + DMERR("Count device %s zones failed", > + disk->disk_name); > + return ret; > + } > + > + zlim->mapped_nr_seq_zones += zc.target_nr_seq_zones; > + > + /* > + * If the target does not map any sequential zones, then we do not need > + * any zone resource limits. > + */ > + if (!zc.target_nr_seq_zones) > + return 0; > + > + /* > + * If the target does not map all sequential zones, the limits > + * will not be reliable. > + */ > + if (zc.target_nr_seq_zones < zc.total_nr_seq_zones) > + zlim->reliable_limits = false; > + > + zlim->max_active_zones = > + min_not_zero(disk->queue->limits.max_active_zones, > + zlim->max_active_zones); > + if (zlim->max_active_zones != UINT_MAX) > + zlim->max_active_zones = > + min(zlim->max_active_zones, zc.target_nr_seq_zones); > + > + zlim->max_open_zones = > + min_not_zero(disk->queue->limits.max_open_zones, > + zlim->max_open_zones); > + if (zlim->max_open_zones != UINT_MAX) > + zlim->max_open_zones = > + min3(zlim->max_open_zones, zc.target_nr_seq_zones, > + zlim->max_active_zones); > + > + return 0; > +} > + > +static int dm_set_zone_resource_limits(struct mapped_device *md, > + struct dm_table *t, struct queue_limits *lim) > +{ > + struct gendisk *disk = md->disk; > + struct dm_zone_resource_limits zlim = { > + .max_open_zones = UINT_MAX, > + .max_active_zones = UINT_MAX, > + .reliable_limits = true, > + }; > + > + /* Get the zone resource limits from the targets. */ > + for (unsigned int i = 0; i < t->num_targets; i++) { > + struct dm_target *ti = dm_table_get_target(t, i); > + > + if (!ti->type->iterate_devices || > + ti->type->iterate_devices(ti, > + device_get_zone_resource_limits, &zlim)) { > + DMERR("Could not determine %s zone resource limits", > + md->disk->disk_name); > + return -ENODEV; > + } > + } > + > + /* > + * If we only have conventional zones mapped, expose the mapped device > + + as a regular device. > + */ > + if (!zlim.mapped_nr_seq_zones) { > + lim->max_open_zones = 0; > + lim->max_active_zones = 0; > + lim->max_zone_append_sectors = 0; > + lim->zone_write_granularity = 0; > + lim->zoned = false; > + clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags); > + md->nr_zones = 0; > + disk->nr_zones = 0; > + return 0; > + } > + > + /* > + * Warn if the mapped device is partially using zone resources of the > + * target devices as that leads to unreliable limits, i.e. if another > + * mapped device uses the same underlying devices, we cannot enforce > + * zone limits to guarantee that writing will not lead to errors. > + * Note that we really should return an error for such case but there is > + * no easy way to find out if another mapped device uses the same > + * underlying zoned devices. > + */ > + if (!zlim.reliable_limits) > + DMWARN("%s zone resource limits may be unreliable", > + disk->disk_name); > + > + if (zlim.max_open_zones >= zlim.mapped_nr_seq_zones) > + lim->max_open_zones = 0; > + else > + lim->max_open_zones = zlim.max_open_zones; > + > + if (zlim.max_active_zones >= zlim.mapped_nr_seq_zones) > + lim->max_active_zones = 0; > + else > + lim->max_active_zones = zlim.max_active_zones; > > return 0; > } > @@ -172,8 +329,13 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t) > int ret; > > /* Revalidate only if something changed. */ > - if (!disk->nr_zones || disk->nr_zones != md->nr_zones) > + if (!disk->nr_zones || disk->nr_zones != md->nr_zones) { > + DMINFO("%s using %s zone append", > + md->disk->disk_name, > + queue_emulates_zone_append(disk->queue) ? > + "emulated" : "native"); > md->nr_zones = 0; > + } > > if (md->nr_zones) > return 0; > @@ -224,8 +386,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q, > struct queue_limits *lim) > { > struct mapped_device *md = t->md; > - struct gendisk *disk = md->disk; > - unsigned int nr_conv_zones = 0; > int ret; > > /* > @@ -244,36 +404,16 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q, > return 0; > > /* > - * Count conventional zones to check that the mapped device will indeed > - * have sequential write required zones. > + * Determine the max open and max active zone limits for the mapped > + * device. For a mapped device containing only conventional zones, the > + * mapped device is changed to be a regular block device, so exit early > + * for such case. > */ > - md->zone_revalidate_map = t; > - ret = dm_blk_report_zones(disk, 0, UINT_MAX, > - dm_check_zoned_cb, &nr_conv_zones); > - md->zone_revalidate_map = NULL; > - if (ret < 0) { > - DMERR("Check zoned failed %d", ret); > + ret = dm_set_zone_resource_limits(md, t, lim); > + if (ret) > return ret; > - } > - > - /* > - * If we only have conventional zones, expose the mapped device as > - * a regular device. > - */ > - if (nr_conv_zones >= ret) { > - lim->max_open_zones = 0; > - lim->max_active_zones = 0; > - lim->zoned = false; > - clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags); > - disk->nr_zones = 0; > + if (!lim->zoned) > return 0; > - } > - > - if (!md->disk->nr_zones) { > - DMINFO("%s using %s zone append", > - md->disk->disk_name, > - queue_emulates_zone_append(q) ? "emulated" : "native"); > - } I would have moved this print in a separate commit. Regardless: Reviewed-by: Niklas Cassel <cassel@xxxxxxxxxx>