On Fri, Jun 11 2021 at 4:25P -0400, mwilck@xxxxxxxx <mwilck@xxxxxxxx> wrote: > From: Martin Wilck <mwilck@xxxxxxxx> > > In virtual deployments, SCSI passthrough over dm-multipath devices is a > common setup. The qemu "pr-helper" was specifically invented for it. I > believe that this is the most important real-world scenario for sending > SG_IO ioctls to device-mapper devices. > > In this configuration, guests send SCSI IO to the hypervisor in the form of > SG_IO ioctls issued by qemu. But on the device-mapper level, these SCSI > ioctls aren't treated like regular IO. Until commit 2361ae595352 ("dm mpath: > switch paths in dm_blk_ioctl() code path"), no path switching was done at > all. Worse though, if an SG_IO call fails because of a path error, > dm-multipath doesn't retry the IO on a another path; rather, the failure is > passed back to the guest, and paths are not marked as faulty. This is in > stark contrast with regular block IO of guests on dm-multipath devices, and > certainly comes as a surprise to users who switch to SCSI passthrough in > qemu. In general, users of dm-multipath devices would probably expect failover > to work at least in a basic way. > > This patch fixes this by taking a special code path for SG_IO on request- > based device mapper targets if CONFIG_DM_MULTIPATH_SG_IO is set. Rather then > just choosing a single path, sending the IO to it, and failing to the caller > if the IO on the path failed, it retries the same IO on another path for > certain error codes, using blk_path_error() to determine if a retry would > make sense for the given error code. Moreover, it sends a message to the > multipath target to mark the path as failed. > > One problem remains open: if all paths in a multipath map are failed, > normal multipath IO may switch to queueing mode (depending on > configuration). This isn't possible for SG_IO, as SG_IO requests can't > easily be queued like regular block I/O. Thus in the "no path" case, the > guest will still see an error. > > Signed-off-by: Martin Wilck <mwilck@xxxxxxxx> > --- > block/scsi_ioctl.c | 5 +- > drivers/md/Kconfig | 11 ++++ > drivers/md/Makefile | 4 ++ > drivers/md/dm-core.h | 5 ++ > drivers/md/dm-rq.h | 11 ++++ > drivers/md/dm-scsi_ioctl.c | 127 +++++++++++++++++++++++++++++++++++++ > drivers/md/dm.c | 20 +++++- > include/linux/blkdev.h | 2 + > 8 files changed, 180 insertions(+), 5 deletions(-) > create mode 100644 drivers/md/dm-scsi_ioctl.c > > diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c > index b39e0835600f..38771f4bcf18 100644 > --- a/block/scsi_ioctl.c > +++ b/block/scsi_ioctl.c > @@ -279,8 +279,8 @@ static int blk_complete_sghdr_rq(struct request *rq, struct sg_io_hdr *hdr, > return ret; > } > > -static int sg_io(struct request_queue *q, struct gendisk *bd_disk, > - struct sg_io_hdr *hdr, fmode_t mode) > +int sg_io(struct request_queue *q, struct gendisk *bd_disk, > + struct sg_io_hdr *hdr, fmode_t mode) > { > unsigned long start_time; > ssize_t ret = 0; > @@ -365,6 +365,7 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk, > blk_put_request(rq); > return ret; > } > +EXPORT_SYMBOL_GPL(sg_io); > > /** > * sg_scsi_ioctl -- handle deprecated SCSI_IOCTL_SEND_COMMAND ioctl > diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig > index f2014385d48b..f28f29e3bd11 100644 > --- a/drivers/md/Kconfig > +++ b/drivers/md/Kconfig > @@ -473,6 +473,17 @@ config DM_MULTIPATH_IOA > > If unsure, say N. > > +config DM_MULTIPATH_SG_IO > + bool "Retry SCSI generic I/O on multipath devices" > + depends on DM_MULTIPATH && BLK_SCSI_REQUEST > + help > + With this option, SCSI generic (SG) requests issued on multipath > + devices will behave similar to regular block I/O: upon failure, > + they are repeated on a different path, and the erroring device > + is marked as failed. > + > + If unsure, say N. > + Given how this is all about multipath, there is no reason to bolt this on to DM core and then play games to issuing multipath target specific DM message ("fail_path") from generic code. So the SG_IO ioctl handling code should be in dm-mpath.c and the DM target interface extended as needed. > config DM_DELAY > tristate "I/O delaying target" > depends on BLK_DEV_DM > diff --git a/drivers/md/Makefile b/drivers/md/Makefile > index ef7ddc27685c..187ea469f64a 100644 > --- a/drivers/md/Makefile > +++ b/drivers/md/Makefile > @@ -88,6 +88,10 @@ ifeq ($(CONFIG_DM_INIT),y) > dm-mod-objs += dm-init.o > endif > > +ifeq ($(CONFIG_DM_MULTIPATH_SG_IO),y) > +dm-mod-objs += dm-scsi_ioctl.o > +endif > + > ifeq ($(CONFIG_DM_UEVENT),y) > dm-mod-objs += dm-uevent.o > endif > diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h > index 5953ff2bd260..8bd8a8e3916e 100644 > --- a/drivers/md/dm-core.h > +++ b/drivers/md/dm-core.h > @@ -189,4 +189,9 @@ extern atomic_t dm_global_event_nr; > extern wait_queue_head_t dm_global_eventq; > void dm_issue_global_event(void); > > +int __dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx, > + struct block_device **bdev, > + struct dm_target **target); > +void dm_unprepare_ioctl(struct mapped_device *md, int srcu_idx); > + > #endif > diff --git a/drivers/md/dm-rq.h b/drivers/md/dm-rq.h > index 1eea0da641db..c6d2853e4d1d 100644 > --- a/drivers/md/dm-rq.h > +++ b/drivers/md/dm-rq.h > @@ -44,4 +44,15 @@ ssize_t dm_attr_rq_based_seq_io_merge_deadline_show(struct mapped_device *md, ch > ssize_t dm_attr_rq_based_seq_io_merge_deadline_store(struct mapped_device *md, > const char *buf, size_t count); > > +#ifdef CONFIG_DM_MULTIPATH_SG_IO > +int dm_sg_io_ioctl(struct block_device *bdev, fmode_t mode, > + unsigned int cmd, unsigned long uarg); > +#else > +static inline int dm_sg_io_ioctl(struct block_device *bdev, fmode_t mode, > + unsigned int cmd, unsigned long uarg) > +{ > + return -ENOTTY; > +} > +#endif > + > #endif There is no reason, that I can see, why bio-based dm-multipath shouldn't handle SG_IO too. Why did you constrain it to request-based? > diff --git a/drivers/md/dm-scsi_ioctl.c b/drivers/md/dm-scsi_ioctl.c > new file mode 100644 > index 000000000000..a696e2a6557e > --- /dev/null > +++ b/drivers/md/dm-scsi_ioctl.c > @@ -0,0 +1,127 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Copyright (C) 2021 Martin Wilck, SUSE LLC > + */ > + > +#include "dm-core.h" > +#include <linux/types.h> > +#include <linux/errno.h> > +#include <linux/kernel.h> > +#include <linux/uaccess.h> > +#include <linux/blk_types.h> > +#include <linux/blkdev.h> > +#include <linux/device-mapper.h> > +#include <scsi/sg.h> > +#include <scsi/scsi_cmnd.h> > + > +#define DM_MSG_PREFIX "sg_io" > + > +int dm_sg_io_ioctl(struct block_device *bdev, fmode_t mode, > + unsigned int cmd, unsigned long uarg) > +{ > + struct mapped_device *md = bdev->bd_disk->private_data; > + struct sg_io_hdr hdr; > + void __user *arg = (void __user *)uarg; > + int rc, srcu_idx; > + char path_name[BDEVNAME_SIZE]; > + > + if (cmd != SG_IO) > + return -ENOTTY; > + > + if (copy_from_user(&hdr, arg, sizeof(hdr))) > + return -EFAULT; > + > + if (hdr.interface_id != 'S') > + return -EINVAL; > + > + if (hdr.dxfer_len > (queue_max_hw_sectors(bdev->bd_disk->queue) << 9)) > + return -EIO; > + > + for (;;) { > + struct dm_target *tgt; > + struct sg_io_hdr rhdr; > + > + rc = __dm_prepare_ioctl(md, &srcu_idx, &bdev, &tgt); > + if (rc < 0) { > + DMERR("%s: failed to get path: %d", > + __func__, rc); > + goto out; > + } > + > + rhdr = hdr; > + > + rc = sg_io(bdev->bd_disk->queue, bdev->bd_disk, &rhdr, mode); > + > + DMDEBUG("SG_IO via %s: rc = %d D%02xH%02xM%02xS%02x", > + bdevname(bdev, path_name), rc, > + rhdr.driver_status, rhdr.host_status, > + rhdr.msg_status, rhdr.status); > + > + /* > + * Errors resulting from invalid parameters shouldn't be retried > + * on another path. > + */ > + switch (rc) { > + case -ENOIOCTLCMD: > + case -EFAULT: > + case -EINVAL: > + case -EPERM: > + goto out; > + default: > + break; > + } > + > + if (rhdr.info & SG_INFO_CHECK) { > + int result; > + blk_status_t sts; > + > + result = rhdr.status | > + (rhdr.msg_status << 8) | > + (rhdr.host_status << 16) | > + (rhdr.driver_status << 24); > + > + sts = __scsi_result_to_blk_status(&result, result); > + rhdr.host_status = host_byte(result); It is the open-coded SCSI specific sg_io_hdr and result work that feels like it doesn't belong open-coded in DM. So maybe the above code from this SG_INFO_CHECK block could go into an block/scsi_ioctl.c:sg_io_info_check() method? > + > + /* See if this is a target or path error. */ > + if (sts == BLK_STS_OK) > + rc = 0; > + else if (blk_path_error(sts)) > + rc = -EIO; > + else { > + rc = blk_status_to_errno(sts); > + goto out; > + } > + } > + > + if (rc == 0) { > + /* success */ > + if (copy_to_user(arg, &rhdr, sizeof(rhdr))) > + rc = -EFAULT; > + goto out; > + } > + > + /* Failure - fail path by sending a message to the target */ > + if (!tgt->type->message) { > + DMWARN("invalid target!"); > + rc = -EIO; > + goto out; > + } else { > + char bdbuf[BDEVT_SIZE]; > + char *argv[2] = { "fail_path", bdbuf }; > + > + scnprintf(bdbuf, sizeof(bdbuf), "%u:%u", > + MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev)); > + > + DMDEBUG("sending \"%s %s\" to target", argv[0], argv[1]); > + rc = tgt->type->message(tgt, 2, argv, NULL, 0); > + if (rc < 0) > + goto out; > + } If you factor things how I suggest below (introducing dm-mpath.c:dm_mpath_ioctl) then you'll have direct access to dm-mpath.c:fail_path() so need need to mess around with constructing DM messages from kernel code. > + > + dm_unprepare_ioctl(md, srcu_idx); > + } > +out: > + dm_unprepare_ioctl(md, srcu_idx); > + return rc; > +} > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index ca2aedd8ee7d..29b93fb3929e 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -522,8 +522,9 @@ static int dm_blk_report_zones(struct gendisk *disk, sector_t sector, > #define dm_blk_report_zones NULL > #endif /* CONFIG_BLK_DEV_ZONED */ > > -static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx, > - struct block_device **bdev) > +int __dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx, > + struct block_device **bdev, > + struct dm_target **target) > { > struct dm_target *tgt; > struct dm_table *map; > @@ -553,10 +554,19 @@ static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx, > goto retry; > } > > + if (r >= 0 && target) > + *target = tgt; > + > return r; > } For dm-multipath you can leverage md->immutable_target always being multipath. So after dm_blk_ioctl's dm_prepare_ioctl: if (cmd == SG_IO && md->immutable_target && md->immutable_target->ioctl) r = md->immutable_target->ioctl(bdev, mode, cmd, arg); (doing check for SG_IO here just avoids needless call into ->ioctl for now, but it could be other ioctls will need specialization in future so checking 'cmd' should be pushed down to md->immutable_target->ioctl at that time?) But I'm not following you use of a for (;;) in dm_sg_io_ioctl() -- other than to retry infinitely (you aren't checking for no paths!?, etc). Anyway, best to have md->immutable_target->ioctl return -EAGAIN and goto top of dm_blk_ioctl as needed? > > -static void dm_unprepare_ioctl(struct mapped_device *md, int srcu_idx) > +static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx, > + struct block_device **bdev) > +{ > + return __dm_prepare_ioctl(md, srcu_idx, bdev, NULL); > +} > + > +void dm_unprepare_ioctl(struct mapped_device *md, int srcu_idx) > { > dm_put_live_table(md, srcu_idx); > } > @@ -567,6 +577,10 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode, > struct mapped_device *md = bdev->bd_disk->private_data; > int r, srcu_idx; > > + if ((dm_get_md_type(md) == DM_TYPE_REQUEST_BASED) && > + ((r = dm_sg_io_ioctl(bdev, mode, cmd, arg)) != -ENOTTY)) > + return r; > + Again, bio-based multipath should work fine with SG_IO too. > r = dm_prepare_ioctl(md, &srcu_idx, &bdev); > if (r < 0) > goto out; > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 48497a77428d..b8f1d603cc7a 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -923,6 +923,8 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t, > unsigned int, void __user *); > extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t, > struct scsi_ioctl_command __user *); > +extern int sg_io(struct request_queue *, struct gendisk *, > + struct sg_io_hdr *, fmode_t); > extern int get_sg_io_hdr(struct sg_io_hdr *hdr, const void __user *argp); > extern int put_sg_io_hdr(const struct sg_io_hdr *hdr, void __user *argp); > > -- > 2.31.1 > Think adding block/scsi_ioctl.c:sg_io_info_check() and exporting it and sg_io() via blkdev.h should be done as a separate patch 2. Then patch 3 is purely DM changes to use sg_io() and sg_io_info_check() Mike