On Thu, May 22, 2014 at 11:55:29AM +0200, Alexander Graf wrote: > >On 22.05.14 10:23, Gavin Shan wrote: >>The patch adds new IOCTL commands for VFIO PCI device to support >>EEH functionality for PCI devices, which have been passed through >>from host to somebody else via VFIO. >> >>Signed-off-by: Gavin Shan <gwshan@xxxxxxxxxxxxxxxxxx> > >This already looks a *lot* more sane than the previous versions. >We're slowly getting there I think ;). > >Ben, could you please check through the exported EEH interface itself >and verify whether it does all the lockings correctly, uses the API >properly and doesn't allow for a user space program to blow up the >system? > > >>--- >> Documentation/vfio.txt | 88 ++++++++++- >> arch/powerpc/include/asm/eeh.h | 17 +++ >> arch/powerpc/kernel/eeh.c | 321 +++++++++++++++++++++++++++++++++++++++++ >> drivers/vfio/pci/vfio_pci.c | 131 ++++++++++++++++- >> include/uapi/linux/vfio.h | 53 +++++++ >> 5 files changed, 603 insertions(+), 7 deletions(-) >> >>diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt >>index b9ca023..dd13db6 100644 >>--- a/Documentation/vfio.txt >>+++ b/Documentation/vfio.txt >>@@ -305,7 +305,10 @@ faster, the map/unmap handling has been implemented in real mode which provides >> an excellent performance which has limitations such as inability to do >> locked pages accounting in real time. >>-So 3 additional ioctls have been added: >>+4) PPC64 guests detect PCI errors and recover from them via EEH RTAS services, >>+which works on the basis of additional ioctl commands. >>+ >>+So 8 additional ioctls have been added: >> VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start >> of the DMA window on the PCI bus. >>@@ -316,6 +319,20 @@ So 3 additional ioctls have been added: >> VFIO_IOMMU_DISABLE - disables the container. >>+ VFIO_EEH_PE_SET_OPTION - enables or disables EEH functinality on the > >functionality > Typo. Will fix in next revision. >>+ specified device. Also, it can be used to remove IO or DMA >>+ stopped state on the frozen PE. >>+ >>+ VFIO_EEH_PE_GET_ADDR - retrieve the unique address of the specified >>+ PE or query PE sharing mode. > >What is PE? > will add description about that to Documentation/vfio.txt. >>+ >>+ VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state. >>+ >>+ VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for >>+ error recovering. >>+ >>+ VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's >>+ one of the major steps for error recoverying. >> The code flow from the example above should be slightly changed: >>@@ -346,6 +363,75 @@ The code flow from the example above should be slightly changed: >> ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map); >> ..... >>+Based on the initial example we have, the following piece of code could be >>+reference for EEH setup and error handling: >>+ >>+ struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) }; >>+ struct vfio_eeh_pe_get_addr addr = { .argsz = sizeof(addr) }; >>+ struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) }; >>+ struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) }; >>+ struct vfio_eeh_pe_configure config = { .argsz = sizeof(config) }; >>+ >>+ .... >>+ >>+ /* Get a file descriptor for the device */ >>+ device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0"); >>+ >>+ /* Enable the EEH functionality on the device */ >>+ option.option = EEH_OPT_ENABLE; >>+ ioctl(device, VFIO_EEH_PE_SET_OPTION, &option); >>+ >>+ /* Retrieve PE address and create and maintain PE by yourself */ >>+ addr.option = EEH_OPT_GET_PE_ADDR; >>+ ioctl(device, VFIO_EEH_PE_GET_ADDR, &addr); >>+ >>+ /* Assure EEH is supported on the PE and make PE functional */ >>+ ioctl(device, VFIO_EEH_PE_GET_STATE, &state); >>+ >>+ /* Save device's state. pci_save_state() would be good enough >>+ * as an example. >>+ */ >>+ >>+ /* Test and setup the device */ >>+ ioctl(device, VFIO_DEVICE_GET_INFO, &device_info); >>+ >>+ .... >>+ >>+ /* When 0xFF's returned from reading PCI config space or IO BARs >>+ * of the PCI device. Check the PE state to see if that has been >>+ * frozen. >>+ */ >>+ ioctl(device, VFIO_EEH_PE_GET_STATE, &state); >>+ >>+ /* Waiting for pending PCI transactions to be completed and don't >>+ * produce any more PCI traffic from/to the affected PE until >>+ * recovery is finished. >>+ */ >>+ >>+ /* Enable IO for the affected PE and collect logs. Usually, the >>+ * standard part of PCI config space, AER registers are dumped >>+ * as logs for further analysis. >>+ */ >>+ option.option = EEH_OPT_THAW_MMIO; >>+ ioctl(device, VFIO_EEH_PE_SET_OPTION, &option); >>+ >>+ /* Issue PE reset */ >>+ reset.option = EEH_RESET_HOT; >>+ ioctl(device, VFIO_EEH_PE_RESET, &reset); >>+ >>+ /* Configure the PCI bridges for the affected PE */ >>+ ioctl(device, VFIO_EEH_PE_CONFIGURE, NULL); >>+ >>+ /* Restored state we saved at initialization time. pci_restore_state() >>+ * is good enough as an example. >>+ */ >>+ >>+ /* Hopefully, error is recovered successfully. Now, you can resume to >>+ * start PCI traffic to/from the affected PE. >>+ */ >>+ >>+ .... >>+ >> ------------------------------------------------------------------------------- >> [1] VFIO was originally an acronym for "Virtual Function I/O" in its >>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h >>index 34a2d83..dd5f1cf 100644 >>--- a/arch/powerpc/include/asm/eeh.h >>+++ b/arch/powerpc/include/asm/eeh.h >>@@ -191,6 +191,11 @@ enum { >> #define EEH_OPT_ENABLE 1 /* EEH enable */ >> #define EEH_OPT_THAW_MMIO 2 /* MMIO enable */ >> #define EEH_OPT_THAW_DMA 3 /* DMA enable */ >>+#define EEH_OPT_GET_PE_ADDR 0 /* Get PE addr */ >>+#define EEH_OPT_GET_PE_MODE 1 /* Get PE mode */ >>+#define EEH_PE_MODE_NONE 0 /* Invalid mode */ >>+#define EEH_PE_MODE_NOT_SHARED 1 /* Not shared */ >>+#define EEH_PE_MODE_SHARED 2 /* Shared mode */ >> #define EEH_STATE_UNAVAILABLE (1 << 0) /* State unavailable */ >> #define EEH_STATE_NOT_SUPPORT (1 << 1) /* EEH not supported */ >> #define EEH_STATE_RESET_ACTIVE (1 << 2) /* Active reset */ >>@@ -198,6 +203,11 @@ enum { >> #define EEH_STATE_DMA_ACTIVE (1 << 4) /* Active DMA */ >> #define EEH_STATE_MMIO_ENABLED (1 << 5) /* MMIO enabled */ >> #define EEH_STATE_DMA_ENABLED (1 << 6) /* DMA enabled */ >>+#define EEH_PE_STATE_NORMAL 0 /* Normal state */ >>+#define EEH_PE_STATE_RESET 1 /* PE reset */ >>+#define EEH_PE_STATE_STOPPED_IO_DMA 2 /* Stopped */ >>+#define EEH_PE_STATE_STOPPED_DMA 4 /* Stopped DMA */ >>+#define EEH_PE_STATE_UNAVAIL 5 /* Unavailable */ >> #define EEH_RESET_DEACTIVATE 0 /* Deactivate the PE reset */ >> #define EEH_RESET_HOT 1 /* Hot reset */ >> #define EEH_RESET_FUNDAMENTAL 3 /* Fundamental reset */ >>@@ -305,6 +315,13 @@ void eeh_add_device_late(struct pci_dev *); >> void eeh_add_device_tree_late(struct pci_bus *); >> void eeh_add_sysfs_files(struct pci_bus *); >> void eeh_remove_device(struct pci_dev *); >>+int eeh_dev_open(struct pci_dev *pdev); >>+void eeh_dev_release(struct pci_dev *pdev); >>+int eeh_pe_set_option(struct pci_dev *pdev, int option); >>+int eeh_pe_get_addr(struct pci_dev *pdev, int option); >>+int eeh_pe_get_state(struct pci_dev *pdev); >>+int eeh_pe_reset(struct pci_dev *pdev, int option); >>+int eeh_pe_configure(struct pci_dev *pdev); >> /** >> * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure. >>diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c >>index 9c6b899..b90a474 100644 >>--- a/arch/powerpc/kernel/eeh.c >>+++ b/arch/powerpc/kernel/eeh.c >>@@ -108,6 +108,9 @@ struct eeh_ops *eeh_ops = NULL; >> /* Lock to avoid races due to multiple reports of an error */ >> DEFINE_RAW_SPINLOCK(confirm_error_lock); >>+/* Lock to protect passed flags */ >>+static DEFINE_MUTEX(eeh_dev_mutex); >>+ >> /* Buffer for reporting pci register dumps. Its here in BSS, and >> * not dynamically alloced, so that it ends up in RMO where RTAS >> * can access it. >>@@ -1098,6 +1101,324 @@ void eeh_remove_device(struct pci_dev *dev) >> edev->mode &= ~EEH_DEV_SYSFS; >> } >>+/** >>+ * eeh_dev_open - Mark EEH device and PE as passed through >>+ * @pdev: PCI device >>+ * >>+ * Mark the indicated EEH device and PE as passed through. >>+ * In the result, the EEH errors detected on the PE won't be >>+ * reported. The owner of the device will be responsible for >>+ * detection and recovery. >>+ */ >>+int eeh_dev_open(struct pci_dev *pdev) >>+{ >>+ struct eeh_dev *edev; >>+ >>+ mutex_lock(&eeh_dev_mutex); >>+ >>+ /* No PCI device ? */ >>+ if (!pdev) { >>+ mutex_unlock(&eeh_dev_mutex); >>+ return -ENODEV; >>+ } >>+ >>+ /* No EEH device ? */ >>+ edev = pci_dev_to_eeh_dev(pdev); >>+ if (!edev || !edev->pe) { >>+ mutex_unlock(&eeh_dev_mutex); >>+ return -ENODEV; >>+ } >>+ >>+ eeh_dev_set_passed(edev, true); >>+ eeh_pe_set_passed(edev->pe, true); >>+ mutex_unlock(&eeh_dev_mutex); >>+ >>+ return 0; >>+} >>+EXPORT_SYMBOL_GPL(eeh_dev_open); >>+ >>+/** >>+ * eeh_dev_release - Reclaim the ownership of EEH device >>+ * @pdev: PCI device >>+ * >>+ * Reclaim ownership of EEH device, potentially the corresponding >>+ * PE. In the result, the EEH errors detected on the PE will be >>+ * reported and handled as usual. >>+ */ >>+void eeh_dev_release(struct pci_dev *pdev) >>+{ >>+ bool release_pe = true; >>+ struct eeh_pe *pe = NULL; >>+ struct eeh_dev *tmp, *edev; >>+ >>+ mutex_lock(&eeh_dev_mutex); >>+ >>+ /* No PCI device ? */ >>+ if (!pdev) { >>+ mutex_unlock(&eeh_dev_mutex); >>+ return; >>+ } >>+ >>+ /* No EEH device ? */ >>+ edev = pci_dev_to_eeh_dev(pdev); >>+ if (!edev || !eeh_dev_passed(edev) || >>+ !edev->pe || !eeh_pe_passed(pe)) { >>+ mutex_unlock(&eeh_dev_mutex); >>+ return; >>+ } >>+ >>+ /* Release device */ >>+ pe = edev->pe; >>+ eeh_dev_set_passed(edev, false); >>+ >>+ /* Release PE */ >>+ eeh_pe_for_each_dev(pe, edev, tmp) { >>+ if (eeh_dev_passed(edev)) { >>+ release_pe = false; >>+ break; >>+ } >>+ } >>+ >>+ if (release_pe) >>+ eeh_pe_set_passed(pe, false); >>+ >>+ mutex_unlock(&eeh_dev_mutex); >>+} >>+EXPORT_SYMBOL(eeh_dev_release); >>+ >>+static int eeh_dev_check(struct pci_dev *pdev, >>+ struct eeh_dev **pedev, >>+ struct eeh_pe **ppe) >>+{ >>+ struct eeh_dev *edev; >>+ >>+ /* No device ? */ >>+ if (!pdev) >>+ return -ENODEV; >>+ >>+ edev = pci_dev_to_eeh_dev(pdev); >>+ if (!edev || !eeh_dev_passed(edev) || >>+ !edev->pe || !eeh_pe_passed(edev->pe)) >>+ return -ENODEV; >>+ >>+ if (pedev) >>+ *pedev = edev; >>+ if (ppe) >>+ *ppe = edev->pe; >>+ >>+ return 0; >>+} >>+ >>+/** >>+ * eeh_pe_set_option - Set options for the indicated PE >>+ * @pdev: PCI device >>+ * @option: requested option >>+ * >>+ * The routine is called to enable or disable EEH functionality >>+ * on the indicated PE, to enable IO or DMA for the frozen PE. >>+ */ >>+int eeh_pe_set_option(struct pci_dev *pdev, int option) >>+{ >>+ struct eeh_dev *edev; >>+ struct eeh_pe *pe; >>+ int ret = 0; >>+ >>+ /* Device existing ? */ >>+ ret = eeh_dev_check(pdev, &edev, &pe); >>+ if (ret) >>+ return ret; >>+ >>+ switch (option) { >>+ case EEH_OPT_DISABLE: >>+ case EEH_OPT_ENABLE: > >This deserves a comment > Yep. will add it, thanks. >>+ break; >>+ case EEH_OPT_THAW_MMIO: >>+ case EEH_OPT_THAW_DMA: >>+ if (!eeh_ops || !eeh_ops->set_option) { >>+ ret = -ENOENT; >>+ break; >>+ } >>+ >>+ ret = eeh_ops->set_option(pe, option); >>+ break; >>+ default: >>+ pr_debug("%s: Option %d out of range (%d, %d)\n", >>+ __func__, option, EEH_OPT_DISABLE, EEH_OPT_THAW_DMA); >>+ ret = -EINVAL; >>+ } >>+ >>+ return ret; >>+} >>+EXPORT_SYMBOL_GPL(eeh_pe_set_option); >>+ >>+/** >>+ * eeh_pe_get_addr - Retrieve the PE address or sharing mode >>+ * @pdev: PCI device >>+ * @option: option >>+ * >>+ * Retrieve the PE address or sharing mode. >>+ */ >>+int eeh_pe_get_addr(struct pci_dev *pdev, int option) >>+{ >>+ struct pci_bus *bus; >>+ struct eeh_dev *edev; >>+ struct eeh_pe *pe; >>+ int ret = 0; >>+ >>+ /* Device existing ? */ >>+ ret = eeh_dev_check(pdev, &edev, &pe); >>+ if (ret) > >Probably safer to write this as ret < 0. Positive return values are >success now. > Ok. Will fix it. >>+ return ret; >>+ >>+ switch (option) { >>+ case EEH_OPT_GET_PE_ADDR: >>+ bus = eeh_pe_bus_get(pe); >>+ if (!bus) { >>+ ret = -ENODEV; >>+ break; >>+ } >>+ >>+ /* PE address has format "00BBSS00" */ >>+ ret = bus->number << 16; >>+ break; >>+ case EEH_OPT_GET_PE_MODE: >>+ /* Wa always have shared PE */ > >Wa? > Basically, PE could have one single PCI device (function), or a PCI bus (including subordinate PCI devices). we always have the later case currently. And it's called PE in "shared mode" and I called it as "shared PE" :-) >>+ ret = EEH_PE_MODE_SHARED; >>+ break; >>+ default: >>+ pr_debug("%s: option %d out of range (0, 1)\n", >>+ __func__, option); >>+ ret = -EINVAL; >>+ } >>+ >>+ return ret; >>+} >>+EXPORT_SYMBOL_GPL(eeh_pe_get_addr); >>+ >>+/** >>+ * eeh_pe_get_state - Retrieve PE's state >>+ * @pdev: PCI device >>+ * >>+ * Retrieve the PE's state, which includes 3 aspects: enabled >>+ * DMA, enabled IO and asserted reset. >>+ */ >>+int eeh_pe_get_state(struct pci_dev *pdev) >>+{ >>+ struct eeh_dev *edev; >>+ struct eeh_pe *pe; >>+ int result, ret = 0; >>+ >>+ /* Device existing ? */ >>+ ret = eeh_dev_check(pdev, &edev, &pe); >>+ if (ret) >>+ return ret; >>+ >>+ if (!eeh_ops || !eeh_ops->get_state) >>+ return -ENOENT; >>+ >>+ result = eeh_ops->get_state(pe, NULL); >>+ if (!(result & EEH_STATE_RESET_ACTIVE) && >>+ (result & EEH_STATE_DMA_ENABLED) && >>+ (result & EEH_STATE_MMIO_ENABLED)) >>+ ret = EEH_PE_STATE_NORMAL; >>+ else if (result & EEH_STATE_RESET_ACTIVE) >>+ ret = EEH_PE_STATE_RESET; >>+ else if (!(result & EEH_STATE_RESET_ACTIVE) && >>+ !(result & EEH_STATE_DMA_ENABLED) && >>+ !(result & EEH_STATE_MMIO_ENABLED)) >>+ ret = EEH_PE_STATE_STOPPED_IO_DMA; >>+ else if (!(result & EEH_STATE_RESET_ACTIVE) && >>+ (result & EEH_STATE_DMA_ENABLED) && >>+ !(result & EEH_STATE_MMIO_ENABLED)) >>+ ret = EEH_PE_STATE_STOPPED_DMA; >>+ else >>+ ret = EEH_PE_STATE_UNAVAIL; >>+ >>+ return ret; >>+} >>+EXPORT_SYMBOL_GPL(eeh_pe_get_state); >>+ >>+/** >>+ * eeh_pe_reset - Issue PE reset according to specified type >>+ * @pdev: PCI device >>+ * @option: reset type >>+ * >>+ * The routine is called to reset the specified PE with the >>+ * indicated type, either fundamental reset or hot reset. >>+ * PE reset is the most important part for error recovery. >>+ */ >>+int eeh_pe_reset(struct pci_dev *pdev, int option) >>+{ >>+ struct eeh_dev *edev; >>+ struct eeh_pe *pe; >>+ int ret = 0; >>+ >>+ /* Device existing ? */ >>+ ret = eeh_dev_check(pdev, &edev, &pe); >>+ if (ret) >>+ return ret; >>+ >>+ if (!eeh_ops || !eeh_ops->set_option || !eeh_ops->reset) >>+ return -ENOENT; >>+ >>+ switch (option) { >>+ case EEH_RESET_DEACTIVATE: >>+ ret = eeh_ops->reset(pe, option); >>+ if (ret) >>+ break; >>+ >>+ /* >>+ * The PE is still in frozen state and we need clear > >to > Will fix, thanks. >>+ * that. It's good to clear frozen state after deassert >>+ * to avoid messy IO access during reset, which might >>+ * cause recursive frozen PE. >>+ */ >>+ ret = eeh_ops->set_option(pe, EEH_OPT_THAW_MMIO); >>+ if (!ret) >>+ ret = eeh_ops->set_option(pe, EEH_OPT_THAW_DMA); >>+ if (!ret) >>+ eeh_pe_state_clear(pe, EEH_PE_ISOLATED); >>+ break; >>+ case EEH_RESET_HOT: >>+ case EEH_RESET_FUNDAMENTAL: >>+ ret = eeh_ops->reset(pe, option); >>+ break; >>+ default: >>+ pr_debug("%s: Unsupported option %d\n", >>+ __func__, option); >>+ ret = -EINVAL; >>+ } >>+ >>+ return ret; >>+} >>+EXPORT_SYMBOL_GPL(eeh_pe_reset); >>+ >>+/** >>+ * eeh_pe_configure - Configure PCI bridges after PE reset >>+ * @pdev: PCI device >>+ * >>+ * The routine is called to restore the PCI config space for >>+ * those PCI devices, especially PCI bridges affected by PE >>+ * reset issued previously. > >So would it make sense to combine this with the reset call? > I hope to keep it as it's one of the major steps to do error recovery. Lets keep it. >>+ */ >>+int eeh_pe_configure(struct pci_dev *pdev) >>+{ >>+ struct eeh_dev *edev; >>+ struct eeh_pe *pe; >>+ int ret = 0; >>+ >>+ /* Device existing ? */ >>+ ret = eeh_dev_check(pdev, &edev, &pe); >>+ if (ret) >>+ return ret; >>+ >>+ /* Restore config space for the affected devices */ >>+ eeh_pe_restore_bars(pe); >>+ >>+ return ret; >>+} >>+EXPORT_SYMBOL_GPL(eeh_pe_configure); >>+ >> static int proc_eeh_show(struct seq_file *m, void *v) >> { >> if (!eeh_enabled()) { >>diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c >>index 7ba0424..301ac18 100644 >>--- a/drivers/vfio/pci/vfio_pci.c >>+++ b/drivers/vfio/pci/vfio_pci.c >>@@ -25,6 +25,9 @@ >> #include <linux/types.h> >> #include <linux/uaccess.h> >> #include <linux/vfio.h> >>+#ifdef CONFIG_EEH >>+#include <asm/eeh.h> >>+#endif >> #include "vfio_pci_private.h" >>@@ -152,32 +155,57 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev) >> pci_restore_state(pdev); >> } >>+static void vfio_eeh_pci_release(struct pci_dev *pdev) >>+{ >>+#ifdef CONFIG_EEH >>+ eeh_dev_release(pdev); >>+#endif >>+} >>+ >> static void vfio_pci_release(void *device_data) >> { >> struct vfio_pci_device *vdev = device_data; >>- if (atomic_dec_and_test(&vdev->refcnt)) >>+ if (atomic_dec_and_test(&vdev->refcnt)) { >>+ vfio_eeh_pci_release(vdev->pdev); >> vfio_pci_disable(vdev); >>+ } >> module_put(THIS_MODULE); >> } >>+static int vfio_eeh_pci_open(struct pci_dev *pdev) >>+{ >>+ int ret = 0; >>+ >>+#ifdef CONFIG_EEH >>+ ret = eeh_dev_open(pdev); >>+#endif >>+ return ret; >>+} >>+ >> static int vfio_pci_open(void *device_data) >> { >> struct vfio_pci_device *vdev = device_data; >>+ int ret; >> if (!try_module_get(THIS_MODULE)) >> return -ENODEV; >> if (atomic_inc_return(&vdev->refcnt) == 1) { >>- int ret = vfio_pci_enable(vdev); >>- if (ret) { >>- module_put(THIS_MODULE); >>- return ret; >>- } >>+ ret = vfio_pci_enable(vdev); >>+ if (ret) >>+ goto error; >>+ >>+ ret = vfio_eeh_pci_open(vdev->pdev); >>+ if (ret) >>+ goto error; >> } >> return 0; >>+error: >>+ module_put(THIS_MODULE); >>+ return ret; >> } >> static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type) >>@@ -321,6 +349,91 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev, >> return walk.ret; >> } >>+static int vfio_eeh_pci_ioctl(struct pci_dev *pdev, >>+ unsigned int cmd, >>+ unsigned long arg) >>+{ >>+ unsigned long minsz; >>+ int ret = 0; >>+ >>+#ifdef CONFIG_EEH >>+ switch (cmd) { >>+ case VFIO_EEH_PE_SET_OPTION: { >>+ struct vfio_eeh_pe_set_option option; >>+ >>+ minsz = offsetofend(struct vfio_eeh_pe_set_option, option); >>+ if (copy_from_user(&option, (void __user *)arg, minsz)) >>+ return -EFAULT; >>+ if (option.argsz < minsz) >>+ return -EINVAL; >>+ >>+ ret = eeh_pe_set_option(pdev, option.option); >>+ break; >>+ } >>+ case VFIO_EEH_PE_GET_ADDR: { >>+ struct vfio_eeh_pe_get_addr addr; >>+ >>+ minsz = offsetofend(struct vfio_eeh_pe_get_addr, info); >>+ if (copy_from_user(&addr, (void __user *)arg, minsz)) >>+ return -EFAULT; >>+ if (addr.argsz < minsz) >>+ return -EINVAL; >>+ >>+ ret = eeh_pe_get_addr(pdev, addr.option); >>+ if (ret >= 0) { >>+ addr.info = ret; >>+ if (copy_to_user((void __user *)arg, &addr, minsz)) >>+ return -EFAULT; >>+ ret = 0; >>+ } >>+ >>+ break; >>+ } >>+ case VFIO_EEH_PE_GET_STATE: { >>+ struct vfio_eeh_pe_get_state state; >>+ >>+ minsz = offsetofend(struct vfio_eeh_pe_get_state, state); >>+ if (copy_from_user(&state, (void __user *)arg, minsz)) >>+ return -EFAULT; >>+ if (state.argsz < minsz) >>+ return -EINVAL; >>+ >>+ ret = eeh_pe_get_state(pdev); >>+ if (ret >= 0) { >>+ state.state = ret; >>+ if (copy_to_user((void __user *)arg, &state, minsz)) >>+ return -EFAULT; >>+ ret = 0; >>+ } >>+ break; >>+ } >>+ case VFIO_EEH_PE_RESET: { >>+ struct vfio_eeh_pe_reset reset; >>+ >>+ minsz = offsetofend(struct vfio_eeh_pe_reset, option); >>+ if (copy_from_user(&reset, (void __user *)arg, minsz)) >>+ return -EFAULT; >>+ if (reset.argsz < minsz) >>+ return -EINVAL; >>+ >>+ ret = eeh_pe_reset(pdev, reset.option); >>+ break; >>+ } >>+ case VFIO_EEH_PE_CONFIGURE: >>+ ret = eeh_pe_configure(pdev); >>+ break; >>+ default: >>+ ret = -EINVAL; >>+ pr_debug("%s: Cannot handle command %d\n", >>+ __func__, cmd); >>+ } >>+#else >>+ ret = -ENOENT; >>+#endif >>+ >>+ return ret; >>+} >>+ >> static long vfio_pci_ioctl(void *device_data, >> unsigned int cmd, unsigned long arg) >> { >>@@ -682,6 +795,12 @@ hot_reset_release: >> kfree(groups); >> return ret; >>+ } else if (cmd == VFIO_EEH_PE_SET_OPTION || >>+ cmd == VFIO_EEH_PE_GET_ADDR || >>+ cmd == VFIO_EEH_PE_GET_STATE || >>+ cmd == VFIO_EEH_PE_RESET || >>+ cmd == VFIO_EEH_PE_CONFIGURE) { >>+ return vfio_eeh_pci_ioctl(vdev->pdev, cmd, arg); >> } >> return -ENOTTY; >>diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h >>index cb9023d..ef55682 100644 >>--- a/include/uapi/linux/vfio.h >>+++ b/include/uapi/linux/vfio.h >>@@ -455,6 +455,59 @@ struct vfio_iommu_spapr_tce_info { >> #define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12) >>+/* >>+ * EEH functionality can be enabled or disabled on one specific device. >>+ * Also, the DMA or IO frozen state can be removed from the frozen PE >>+ * if required. >>+ */ >>+struct vfio_eeh_pe_set_option { >>+ __u32 argsz; > >What is this argsz thing? Is this your way of maintaining backwards >compatibility when we introduce new fields? A new field will change >the ioctl number, so I don't think that makes a lot of sense :). > >Just make the ioctl have a u32 as incoming argument. No fancy >structs, no complicated code. > >The same applies for a number of structs below. > ok. Will do in next revision. >>+ __u32 option; >>+}; >>+ >>+#define VFIO_EEH_PE_SET_OPTION _IO(VFIO_TYPE, VFIO_BASE + 21) >>+ >>+/* >>+ * Each EEH PE should have unique address to be identified. The command >>+ * helps to retrieve the address and the sharing mode of the PE. >>+ */ >>+struct vfio_eeh_pe_get_addr { >>+ __u32 argsz; >>+ __u32 option; >>+ __u32 info; > >Any particular reason you need the info field? Can't the return value >of the ioctl hold this? Then you only have a single u32 argument left >to the ioctl again. > ok. Will do in next revision. >>+}; >>+ >>+#define VFIO_EEH_PE_GET_ADDR _IO(VFIO_TYPE, VFIO_BASE + 22) >>+ >>+/* >>+ * EEH PE might have been frozen because of PCI errors. Also, it might >>+ * be experiencing reset for error revoery. The following command helps >>+ * to get the state. >>+ */ >>+struct vfio_eeh_pe_get_state { >>+ __u32 argsz; >>+ __u32 state; >>+}; >>+ >>+#define VFIO_EEH_PE_GET_STATE _IO(VFIO_TYPE, VFIO_BASE + 23) >>+ >>+/* >>+ * Reset is the major step to recover problematic PE. The following >>+ * command helps on that. >>+ */ >>+struct vfio_eeh_pe_reset { >>+ __u32 argsz; >>+ __u32 option; >>+}; >>+ >>+#define VFIO_EEH_PE_RESET _IO(VFIO_TYPE, VFIO_BASE + 24) >>+ >>+/* >>+ * One of the steps for recovery after PE reset is to configure the >>+ * PCI bridges affected by the PE reset. >>+ */ >>+#define VFIO_EEH_PE_CONFIGURE _IO(VFIO_TYPE, VFIO_BASE + 25) >>+ >> /* ***************************************************************** */ >> #endif /* _UAPIVFIO_H */ Thanks, Gavin -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html