On Tue, Sep 20 2022 at 5:11P -0400, Pankaj Raghav <p.raghav@xxxxxxxxxxx> wrote: > Only zoned devices with power-of-2(po2) number of sectors per zone(zone > size) were supported in linux but now non power-of-2(npo2) zone sizes > support has been added to the block layer. > > Filesystems such as F2FS and btrfs have support for zoned devices with > po2 zone size assumption. Before adding native support for npo2 zone > sizes, it was suggested to create a dm target for npo2 zone size device to > appear as a po2 zone size target so that file systems can initially > work without any explicit changes. > > The design of this target is very simple: remap the device zone size to > the zone capacity and change the zone size to be the nearest power of 2 > value. > > For e.g., a device with a zone size/capacity of 3M will have an equivalent > target layout as follows: > > Device layout :- > zone capacity = 3M > zone size = 3M > > |--------------|-------------| > 0 3M 6M > > Target layout :- > zone capacity=3M > zone size = 4M > > |--------------|---|--------------|---| > 0 3M 4M 7M 8M > > The area between target's zone capacity and zone size will be emulated > in the target. > The read IOs that fall in the emulated gap area will return 0 filled > bio and all the other IOs in that area will result in an error. > If a read IO span across the emulated area boundary, then the IOs are > split across them. All other IO operations that span across the emulated > area boundary will result in an error. > > The target can be easily created as follows: > dmsetup create <label> --table '0 <size_sects> po2zoned /dev/nvme<id>' > > The target does not support partial mapping of the underlying > device as there is no use-case for it. > > Note: > This target is not related to dm-zoned target, which exposes a zoned block > device as a regular block device without any write constraint. > > This target only exposes a different zone size than the underlying device. > The underlying device's other constraints will be directly exposed to the > target. > > Signed-off-by: Pankaj Raghav <p.raghav@xxxxxxxxxxx> > Suggested-by: Johannes Thumshirn <johannes.thumshirn@xxxxxxx> > Suggested-by: Damien Le Moal <damien.lemoal@xxxxxxx> > Suggested-by: Hannes Reinecke <hare@xxxxxxx> > --- > .../admin-guide/device-mapper/dm-po2zoned.rst | 79 +++++ > .../admin-guide/device-mapper/index.rst | 1 + > drivers/md/Kconfig | 10 + > drivers/md/Makefile | 2 + > drivers/md/dm-po2zoned-target.c | 291 ++++++++++++++++++ > 5 files changed, 383 insertions(+) > create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zoned.rst > create mode 100644 drivers/md/dm-po2zoned-target.c > > diff --git a/Documentation/admin-guide/device-mapper/dm-po2zoned.rst b/Documentation/admin-guide/device-mapper/dm-po2zoned.rst > new file mode 100644 > index 000000000000..8a35eab0b714 > --- /dev/null > +++ b/Documentation/admin-guide/device-mapper/dm-po2zoned.rst > @@ -0,0 +1,79 @@ > +=========== > +dm-po2zoned > +=========== > +The dm-po2zoned device mapper target exposes a zoned block device with a > +non-power-of-2(npo2) number of sectors per zone as a power-of-2(po2) > +number of sectors per zone(zone size). > +The filesystems that support zoned block devices such as F2FS and BTRFS > +assume po2 zone size as the kernel has traditionally only supported > +those devices. However, as the kernel now supports zoned block devices with > +npo2 zone sizes, the filesystems can run on top of the dm-po2zoned target before > +adding native support. > + > +Partial mapping of the underlying device is not supported by this target as > +there is no use-case for it. > + > +.. note:: > + This target is **not related** to **dm-zoned target**, which exposes a > + zoned block device as a regular block device without any write constraint. > + > + This target only exposes a different **zone size** than the underlying device. > + The underlying device's other **constraints** will be exposed to the target. > + > +Algorithm > +========= > +The device mapper target maps the underlying device's zone size to the > +zone capacity and changes the zone size to the nearest po2 zone size. > +The gap between the zone capacity and the zone size is emulated in the target. > +E.g., a zoned block device with a zone size (and capacity) of 3M will have an > +equivalent target layout with mapping as follows: > + > +:: > + > + 0M 3M 4M 6M 8M > + | | | | | > + +x------------+--+x---------+--+x------- Target > + |x | |x | |x > + x x x > + x x x > + x x x > + x x x > + |x |x |x > + +x------------+x------------+x---------- Device > + | | | > + 0M 3M 6M > + > +A simple remap is performed for all the BIOs that do not cross the > +emulation gap area, i.e., the area between the zone capacity and size. > + > +If a BIO lies in the emulation gap area, the following operations are performed: > + > + Read: > + - If the BIO lies entirely in the emulation gap area, then zero out the BIO and complete it. > + - If the BIO spans the emulation gap area, split the BIO across the zone capacity boundary > + and remap only the BIO within the zone capacity boundary. The other part of the split BIO > + will be zeroed out. > + > + Other operations: > + - Return an error > + > +Table parameters > +================ > + > +:: > + > + <dev path> > + > +Mandatory parameters: > + > + <dev path>: > + Full pathname to the underlying block-device, or a > + "major:minor" device-number. > + > +Examples > +======== > + > +:: > + > + #!/bin/sh > + echo "0 `blockdev --getsz $1` po2zoned $1" | dmsetup create po2z > diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst > index cde52cc09645..5df93711cef5 100644 > --- a/Documentation/admin-guide/device-mapper/index.rst > +++ b/Documentation/admin-guide/device-mapper/index.rst > @@ -23,6 +23,7 @@ Device Mapper > dm-service-time > dm-uevent > dm-zoned > + dm-po2zoned > era > kcopyd > linear > diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig > index 998a5cfdbc4e..74fdfd02ab5f 100644 > --- a/drivers/md/Kconfig > +++ b/drivers/md/Kconfig > @@ -518,6 +518,16 @@ config DM_FLAKEY > help > A target that intermittently fails I/O for debugging purposes. > > +config DM_PO2ZONED > + tristate "Zoned block devices target emulating a power-of-2 number of sectors per zone" > + depends on BLK_DEV_DM > + depends on BLK_DEV_ZONED > + help > + A target that converts a zoned block device with non-power-of-2(npo2) > + number of sectors per zone to be power-of-2(po2). Use this target for > + zoned block devices with npo2 number of sectors per zone until native > + support is added to the filesystems and applications. > + > config DM_VERITY > tristate "Verity target support" > depends on BLK_DEV_DM > diff --git a/drivers/md/Makefile b/drivers/md/Makefile > index 84291e38dca8..ee05722bc637 100644 > --- a/drivers/md/Makefile > +++ b/drivers/md/Makefile > @@ -26,6 +26,7 @@ dm-era-y += dm-era-target.o > dm-clone-y += dm-clone-target.o dm-clone-metadata.o > dm-verity-y += dm-verity-target.o > dm-zoned-y += dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o > +dm-po2zoned-y += dm-po2zoned-target.o > > md-mod-y += md.o md-bitmap.o > raid456-y += raid5.o raid5-cache.o raid5-ppl.o > @@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT) += dm-crypt.o > obj-$(CONFIG_DM_DELAY) += dm-delay.o > obj-$(CONFIG_DM_DUST) += dm-dust.o > obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o > +obj-$(CONFIG_DM_PO2ZONED) += dm-po2zoned.o > obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o > obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o > obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o > diff --git a/drivers/md/dm-po2zoned-target.c b/drivers/md/dm-po2zoned-target.c > new file mode 100644 > index 000000000000..1d2f46a594f8 > --- /dev/null > +++ b/drivers/md/dm-po2zoned-target.c > @@ -0,0 +1,291 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Copyright (C) 2022 Samsung Electronics Co., Ltd. > + */ > + > +#include <linux/device-mapper.h> > + > +#define DM_MSG_PREFIX "po2zoned" > + > +struct dm_po2z_target { > + struct dm_dev *dev; > + sector_t zone_size; /* Actual zone size of the underlying dev*/ > + sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */ > + unsigned int zone_size_po2_shift; > + sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */ > + unsigned int nr_zones; > +}; > + > +static inline unsigned int npo2_zone_no(struct dm_po2z_target *dmh, > + sector_t sect) > +{ > + return div64_u64(sect, dmh->zone_size); > +} > + > +static inline unsigned int po2_zone_no(struct dm_po2z_target *dmh, > + sector_t sect) > +{ > + return sect >> dmh->zone_size_po2_shift; > +} > + > +static inline sector_t device_to_target_sect(struct dm_target *ti, > + sector_t sect) > +{ > + struct dm_po2z_target *dmh = ti->private; > + > + return sect + (npo2_zone_no(dmh, sect) * dmh->zone_size_diff) + > + ti->begin; > +} > + > +/* > + * This target works on the complete zoned device. Partial mapping is not > + * supported. > + * Construct a zoned po2 logical device: <dev-path> > + */ > +static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv) > +{ > + struct dm_po2z_target *dmh = NULL; > + int ret; > + sector_t zone_size; > + sector_t dev_capacity; > + > + if (argc != 1) > + return -EINVAL; > + > + dmh = kmalloc(sizeof(*dmh), GFP_KERNEL); > + if (!dmh) > + return -ENOMEM; > + > + ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), > + &dmh->dev); > + if (ret) { > + ti->error = "Device lookup failed"; > + goto bad; > + } > + > + if (!bdev_is_zoned(dmh->dev->bdev)) { > + DMERR("%pg is not a zoned device", dmh->dev->bdev); > + ret = -EINVAL; > + goto bad; > + } > + > + zone_size = bdev_zone_sectors(dmh->dev->bdev); > + dev_capacity = get_capacity(dmh->dev->bdev->bd_disk); > + if (ti->len != dev_capacity) { > + DMERR("%pg Partial mapping of the target is not supported", > + dmh->dev->bdev); > + ret = -EINVAL; > + goto bad; > + } > + > + if (is_power_of_2(zone_size)) > + DMWARN("%pg: underlying device has a power-of-2 number of sectors per zone", > + dmh->dev->bdev); > + > + dmh->zone_size = zone_size; > + dmh->zone_size_po2 = 1 << get_count_order_long(zone_size); > + dmh->zone_size_po2_shift = ilog2(dmh->zone_size_po2); > + dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size; > + ti->private = dmh; > + ti->max_io_len = dmh->zone_size_po2; > + dmh->nr_zones = npo2_zone_no(dmh, ti->len); > + ti->len = dmh->zone_size_po2 * dmh->nr_zones; > + return 0; > + > +bad: > + kfree(dmh); > + return ret; > +} This error handling still isn't correct. You're using dm_get_device(). If you return early due to error, _after_ dm_get_device(), you need to dm_put_device(). Basically you need a new label above "bad:" that calls dm_put_device() then falls through to "bad:". Or you need to explcitly call dm_put_device() before "goto bad;" in the if (ti->len != dev_capacity) error branch. Mike -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel