When a devices is driven by userspace (like in the cases of KVM's device assignment) it's essential to provide proper PCI error handling support to the corresponding driver. Implementation ============== The PCI error stub driver is implemented on top of the uio framework. PCI errors are reported to userspace by signaling on an eventfd. Error codes and error results are exposed under a 'logical' BAR that can be mmap'ed. Userspace acknowledge the kernel using another eventfd which internally is consumed by the driver in a similar fashion than KVM's irqfd. The uio implementation accepts 32 bit writes to the device file. This interface is used to multiplex the passing of both eventfd to the driver. The guest-visible interface is not yet define. In theory it could be abstracted part of Q35 PCIe chipset or it could be define as a simple QEMU device model. >From performance's perspective, the error signaling path can reach the guest directly using KVM's irqfd. Both error code and error result can be made directly available to the guest. The guest acknowledge part can signal directly on an ioeventfd. Background information ====================== Error handling happens in two phases; (1) Reporting, (2) Recovery. Phase 1, Reporting (.error_detected callback) ============================================= The purpose of the reporting phase is to A) notify the driver so that it can stop all I/O access to the devices quickly and B) provide an exit path to I/O spinloops or inconsistent internal state. Kernel assume that once the corresponding '.error_detected' callback return, the driver is in quiescent state. Ideally, interruptions to the device should also be disabled. If the driver doesn't respond to the '.error_detected' callback in a timely fashion, it may result in a scenario where an accumulation of posted writes choke the PCIe switch and eventually the CPU complex. Phase 2, Recovery( .mmio_enabled .link_reset .slot_reset .resume ) ================================================================== The purpose of the recovery phase is to provide a mechanism for the driver to 'reconnect' with the device in a sequenced fashion. The times it takes to perform the recovery process usually doesn't impact the host. Reporting and recovery in userspace =================================== Both error reporting and error recovery phases are driven by a state machine where the next state depends on the returned value of the current callback. Depending on the state that state machine is in, the host will perform various operation such as 'slot reset' or 'link reset'. Because of such the only way to maintain consistency between the device and it's corresponding driver in userspace is to perform both phases in lockstep with the host. The need for a policy; Error handling in userspace? =================================================== When the 'policy' is set to 'paranoid', the host kernel doesn't rely on userspace ever therefore as soon as an error is detected related to the device, the corresponding driver's process is terminated. When the 'policy' is set to 'strict', the entire reporting and recovery phases must succeed in lock step. Timeout in the lock step sequence terminate the corresponding driver's process. The 'lazy' mode is similar to the 'strict' mode except that if a timeout happen in the lock step sequence, the host kernel will take a default path and keep the process up and running. Considerations ============== - This is work in progress. Minimal testing. Looking for comments/ideas - For performance reason, the 'lazy' mode is interesting because it may very well operate in an asynchronous fashion without the expensive lock step mode. Obviously, the driver's expected recovery sequence _must_ match the default state machine sequence on the host. - Assuming that the error recovery is directed by qemu/Q35 PCIe, how to get the recovery action for every error callback since aer handling 'merge' callback return code from multiple source. It's hard to discriminate what value should be given back to the host for the corresponding assigned device. - Upon slot_reset who should restore the device's config space? Can the guest's driver do it? - Should suspend and resume for assigned devices be also part of this driver? - The only policy implemented is 'lazy' thanks, -Etienne Signed-off-by: Etienne Martineau <etmartin@xxxxxxxxx> --- drivers/uio/Kconfig | 11 + drivers/uio/Makefile | 2 + drivers/uio/uio_pci_stub.c | 558 ++++++++++++++++++++++++++++++++++++++++++ include/linux/Kbuild | 1 + include/linux/uio_pci_stub.h | 46 ++++ 5 files changed, 618 insertions(+), 0 deletions(-) create mode 100644 drivers/uio/uio_pci_stub.c create mode 100644 include/linux/uio_pci_stub.h diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig index bb44079..d381c0e 100644 --- a/drivers/uio/Kconfig +++ b/drivers/uio/Kconfig @@ -93,5 +93,16 @@ config UIO_NETX To compile this driver as a module, choose M here; the module will be called uio_netx. + +config UIO_PCI_STUB + tristate "PCI error stub driver" + depends on PCI + help + Say Y or M here if you want be able to reserve a PCI device + when it is going to be assigned to a guest operating system and + you want the option to notify userspace in case where the + device report a PCI error. + + When in doubt, say N. endif diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile index 18fd818..50c600a 100644 --- a/drivers/uio/Makefile +++ b/drivers/uio/Makefile @@ -6,3 +6,5 @@ obj-$(CONFIG_UIO_AEC) += uio_aec.o obj-$(CONFIG_UIO_SERCOS3) += uio_sercos3.o obj-$(CONFIG_UIO_PCI_GENERIC) += uio_pci_generic.o obj-$(CONFIG_UIO_NETX) += uio_netx.o +obj-$(CONFIG_UIO_PCI_STUB) += uio_pci_stub.o + diff --git a/drivers/uio/uio_pci_stub.c b/drivers/uio/uio_pci_stub.c new file mode 100644 index 0000000..a5c4bbe --- /dev/null +++ b/drivers/uio/uio_pci_stub.c @@ -0,0 +1,558 @@ +/* + * uio_pci_stub.c - PCI error stub driver + * + * Copyright (C) 2010 Cisco Systems + * Author: Etienne Martineau <etmartin@xxxxxxxxx> + * + * Derived from drivers/pci/pci-stub.c by Chris Wright, + * Copyright (C) 2008 Red Hat, Inc. + * + * This work is licensed under the terms of the GNU GPL, version 2. + * + * Usage is simple, allocate a new id to the uio_pci_stub driver and bind the + * device to it. For example: + * + * # echo "8086 10f5" > /sys/bus/pci/drivers/uio_pci_stub/new_id + * # echo -n 0000:00:19.0 > /sys/bus/pci/drivers/e1000e/unbind + * # echo -n 0000:00:19.0 > /sys/bus/pci/drivers/uio_pci_stub/bind + * # ls -l /sys/bus/pci/devices/0000:00:19.0/driver + * .../0000:00:19.0/driver -> ../../../bus/pci/drivers/uio_pci_stub + * + * Limitation: + * No support for suspend and resume. + * The only policy implemented is LAZY + */ + +#include <linux/module.h> +#include <linux/version.h> +#include <linux/sched.h> +#include <linux/semaphore.h> +#include <linux/pci.h> +#include <linux/poll.h> +#include <linux/file.h> +#include <linux/list.h> +#include <linux/eventfd.h> +#include <linux/uio_driver.h> +#include <linux/uio_pci_stub.h> + +#define DRIVER_VERSION "0.02" +#define DRIVER_AUTHOR "Etienne Martineau <etmartin@xxxxxxxxx>" +#define DRIVER_DESC "PCI error stub driver" +#define DPRINTK(fmt, args...) \ + do{ \ + if(debug) \ + printk(KERN_DEBUG "%s: " fmt, __func__ , ## args); \ +} while (0) + +static int debug=0; +MODULE_PARM_DESC(debug, ""); +module_param(debug, bool, 0644); + +static int timeout=DEFAULT_LOCKSTEP_TIMEOUT; +MODULE_PARM_DESC(timeout, "Lock step sequence timeout in msec (default=10)"); +module_param(timeout, int, 0644); + +static int policy=LAZY; +MODULE_PARM_DESC(policy, "Error handling policy (default=LAZY) " + "PARANOID=1, STRICT=2, LAZY=3"); +module_param(policy, int, 0644); + +static const char * err_to_str[]={ + [ERROR_DETECTED] = "error_detected", + [MMIO_ENABLED] = "mmio_enabled", + [LINK_RESET] = "link_reset", + [SLOT_RESET] = "slot_reset", + [RESUME] = "resume", +}; + +static const char * result_to_str[]={ + [PCI_ERS_RESULT_NONE] = "pci_ers_result_none", + [PCI_ERS_RESULT_CAN_RECOVER] = "pci_ers_result_can_recover", + [PCI_ERS_RESULT_NEED_RESET] = "pci_ers_result_need_reset", + [PCI_ERS_RESULT_DISCONNECT] = "pci_ers_result_disconnect", + [PCI_ERS_RESULT_RECOVERED] = "pci_ers_result_recovered", +}; + +struct pci_stub_priv { + atomic_t refcount; + spinlock_t lock; + poll_table pt; + wait_queue_t wait; + wait_queue_head_t remove_wait; + struct semaphore usersync_lock; + struct eventfd_ctx *eventfd_out; + struct eventfd_ctx *eventfd_in; + char name[UIO_MAX_NAME_SIZE]; +}; + +static int pci_stub_ack(struct pci_stub_priv *priv) +{ + int dummy; + + /* We don't want to post on the semaphore more that one time per + * 'error cycle'. If the lock cannot be acquired it means that + * it's ready to be signaled. + * If the lock is acquired give it back */ + dummy=down_trylock(&priv->usersync_lock); + up(&priv->usersync_lock); + return 0; +} + +static int pci_stub_wakeup(wait_queue_t *wait, unsigned mode, + int sync, void *key) +{ + struct pci_stub_priv *priv = container_of(wait, + struct pci_stub_priv, wait); + unsigned long flags = (unsigned long)key; + + if (flags & POLLIN){ + /* An event has been signaled */ + pci_stub_ack(priv); + } + + if (flags & POLLHUP) { + DPRINTK("POOLHUP"); + /* The eventfd is closing */ + } + + return 0; +} + +static void pci_stub_ptable(struct file *file, wait_queue_head_t *wqh, + poll_table *pt) +{ + struct pci_stub_priv *priv = container_of(pt, + struct pci_stub_priv, pt); + add_wait_queue(wqh, &priv->wait); +} + +static int pci_stub_control(struct uio_info *info, s32 value) +{ + int err=0; + struct pci_stub_priv *priv = info->priv; + struct eventfd_ctx *eventfd; + struct file *file = NULL; + unsigned int events; + + if(policy == PARANOID) + return -EINVAL; + + if(value <=0 || value>=INR_OPEN) + return -EINVAL; + + file = eventfd_fget(value); + if (IS_ERR(file)) + return PTR_ERR(file); + + eventfd = eventfd_ctx_fileget(file); + if (IS_ERR(eventfd)){ + fput(file); + return PTR_ERR(eventfd); + } + + spin_lock(&priv->lock); + + if(priv->eventfd_out && priv->eventfd_in){ + eventfd_ctx_put(eventfd); + err = -EEXIST; + goto out; + } + + /* Warning, the interface is hidden underneath a generic write. + * The first write is for eventfd_out and the second is for eventfd_in */ + if(!priv->eventfd_out){ + DPRINTK("register eventfd_out"); + priv->eventfd_out = eventfd; + err=0; + goto out; + } + else if(!priv->eventfd_in){ + DPRINTK("register eventfd_in"); + priv->eventfd_in = eventfd; + events = file->f_op->poll(file, &priv->pt); + if (events & POLLIN) + pci_stub_ack(priv); + err=0; + goto out; + } + else + err=-EBUSY; + +out: + spin_unlock(&priv->lock); + fput(file); + return err; +} + +static int pci_stub_open(struct uio_info *info, struct inode *inode) +{ + struct pci_stub_priv *priv = info->priv; + atomic_inc(&priv->refcount); + return 0; +} + +static int pci_stub_release(struct uio_info *info, struct inode *inode) +{ + struct pci_stub_priv *priv = info->priv; + + if (!atomic_dec_and_test(&priv->refcount)) + return 0; + + spin_lock(&priv->lock); + + /* Drop all we have when we reach the last reference to the UIO fd */ + if(priv->eventfd_out){ + DPRINTK("release eventfd_out"); + eventfd_ctx_put(priv->eventfd_out); + priv->eventfd_out = NULL; + } + if(priv->eventfd_in){ + DPRINTK("release eventfd_in"); + eventfd_ctx_put(priv->eventfd_in); + priv->eventfd_in = NULL; + } + + spin_unlock(&priv->lock); + + wake_up(&priv->remove_wait); + + return 0; +} + +static int pci_stub_bar_setup(struct uio_info *info, int n) +{ + void *ptr; + + ptr = (void*)get_zeroed_page(GFP_KERNEL); + if(!ptr) + return -ENOMEM; + + info->mem[n].addr = (unsigned long)ptr; + info->mem[n].size = PAGE_SIZE; + info->mem[n].memtype = UIO_MEM_LOGICAL; + return 0; +} + +static void pci_stub_bar_release(struct uio_info *info, int n) +{ + if(info->mem[n].addr) + free_pages((long unsigned int)info->mem[n].addr,0); +} + +static int __devinit probe(struct pci_dev *dev, + const struct pci_device_id *id) +{ + int ret = -ENODEV; + struct uio_info *info; + struct pci_stub_priv *priv; + + info = kzalloc(sizeof(struct uio_info), GFP_KERNEL); + if (!info){ + ret = -ENOMEM; + goto bad; + } + + priv = kzalloc(sizeof(struct pci_stub_priv), GFP_KERNEL); + if (!priv){ + ret = -ENOMEM; + goto bad1; + } + + ret = pci_stub_bar_setup(info, 0); + if(ret) + goto bad2; + + info->priv = priv; + info->version = DRIVER_VERSION; + info->irqcontrol = pci_stub_control; + info->irq = UIO_IRQ_CUSTOM; + info->open = pci_stub_open; + info->release = pci_stub_release; + + snprintf(priv->name, UIO_MAX_NAME_SIZE, + FORMAT_UIO_DEV_NAME(dev->bus->number, PCI_SLOT(dev->devfn), + PCI_FUNC(dev->devfn), id->vendor, id->device)); + info->name = priv->name; + + atomic_set(&priv->refcount,0); + + priv->eventfd_out=NULL; + priv->eventfd_in=NULL; + spin_lock_init(&priv->lock); + + sema_init(&priv->usersync_lock,0); + + init_waitqueue_head(&priv->remove_wait); + + /* + * Install our own custom wake-up handling so we are notified via + * a callback whenever someone signals the underlying eventfd + */ + init_waitqueue_func_entry(&priv->wait, pci_stub_wakeup); + init_poll_funcptr(&priv->pt, pci_stub_ptable); + + pci_set_drvdata(dev, info); + + ret = uio_register_device(&dev->dev, info); + if(ret) + goto bad3; + + dev_printk(KERN_INFO, &dev->dev, "claimed by uio_pci_stub\n"); + return 0; + +bad3: + pci_stub_bar_release(info, 0); +bad2: + kfree(priv); +bad1: + kfree(info); +bad: + return ret; +} + +static void remove(struct pci_dev *dev) +{ + struct uio_info *info = pci_get_drvdata(dev); + struct pci_stub_priv *priv = info->priv; + + /* Don't let the driver de-instantiate while userspace still has + * file reference pending otherwise bad things can happen... */ + if (atomic_read(&priv->refcount)) + wait_event(priv->remove_wait,atomic_read(&priv->refcount)==0); + + uio_unregister_device(info); + pci_set_drvdata(dev, NULL); + pci_stub_bar_release(info, 0); + kfree(info->priv); + kfree(info); + dev_printk(KERN_INFO, &dev->dev, "released by uio_pci_stub\n"); +} + +/* + * For every pci error handlers invoked, userspace is notified. Error codes + * are publish under pci_stub_bar0. + * + * After each notification, Kernel will wait for user space to provide + * the pci error result also under pci_stub_bar0. Upon timeout, kernel + * takes default action. + */ +static pci_ers_result_t pci_stub_notify(u32 err_code, struct pci_dev *pdev) +{ + int err; + pci_ers_result_t pci_result; + struct uio_info *info = pci_get_drvdata(pdev); + struct pci_stub_priv *priv = info->priv; + struct pci_stub_bar0 *bar = (struct pci_stub_bar0 *)info->mem[0].addr; + + if(!atomic_read(&priv->refcount)){ + DPRINTK("offline"); + return PCI_ERS_RESULT_NONE; + } + + spin_lock(&priv->lock); + + /* Make sure we have a valid ref count to eventfd_in and eventfd_out */ + if(!priv->eventfd_out || !priv->eventfd_in){ + spin_unlock(&priv->lock); + DPRINTK("not initialized"); + return PCI_ERS_RESULT_NONE; + } + + bar->err_code = err_code; + bar->vendor = pdev->vendor; + bar->device = pdev->device; + bar->busnr = pdev->bus->number; + bar->devnr = PCI_SLOT(pdev->devfn); + bar->fcnnr = PCI_FUNC(pdev->devfn); + + /* Signal userspace */ + eventfd_signal(priv->eventfd_out,1); + + spin_unlock(&priv->lock); + + DPRINTK("%s",err_to_str[err_code]); + + /* Best effort to sync with userspace */ + err = down_timeout(&priv->usersync_lock, msecs_to_jiffies(timeout)); + + if(err){ + printk(KERN_INFO "sync timeout"); + return PCI_ERS_RESULT_NONE; + } + + if(bar->err_code_ack != err_code){ + printk(KERN_INFO "out of sync"); + return PCI_ERS_RESULT_NONE; + } + + /* Sanity check and latch the result */ + switch(bar->err_result){ + case RESULT_NONE: + pci_result = PCI_ERS_RESULT_NONE; + break; + case RESULT_CAN_RECOVER: + pci_result = PCI_ERS_RESULT_CAN_RECOVER; + break; + case RESULT_NEED_RESET: + pci_result = PCI_ERS_RESULT_NEED_RESET; + break; + case RESULT_DISCONNECT: + pci_result = PCI_ERS_RESULT_DISCONNECT; + break; + case RESULT_RECOVERED: + pci_result = PCI_ERS_RESULT_RECOVERED; + break; + default: + pci_result = PCI_ERS_RESULT_NONE; + } + DPRINTK("%s",result_to_str[pci_result]); + return pci_result; +} + +/* ------------------ PCI Error Recovery infrastructure -------------- */ + +/** + * error_detected - called when PCI error is detected. + * @pdev: Pointer to PCI device + * @state: The current pci connection state + */ +static pci_ers_result_t error_detected(struct pci_dev *pdev, pci_channel_state_t state) +{ + pci_ers_result_t err; + err = pci_stub_notify(ERROR_DETECTED, pdev); + switch(err){ + case PCI_ERS_RESULT_NONE: + case PCI_ERS_RESULT_CAN_RECOVER: + case PCI_ERS_RESULT_NEED_RESET: + case PCI_ERS_RESULT_DISCONNECT: + break; + default: + printk(KERN_INFO "pci_stub: default action in %s",err_to_str[ERROR_DETECTED]); + return PCI_ERS_RESULT_NONE; + } + return err; +} + +/** + * mmio_enabled + * MMIO has been re-enabled, but not DMA + */ +static pci_ers_result_t mmio_enabled(struct pci_dev *pdev) +{ + pci_ers_result_t err; + err = pci_stub_notify(MMIO_ENABLED, pdev); + switch(err){ + case PCI_ERS_RESULT_NONE: + case PCI_ERS_RESULT_NEED_RESET: + case PCI_ERS_RESULT_RECOVERED: + break; + default: + printk(KERN_INFO "pci_stub: default action in %s",err_to_str[MMIO_ENABLED]); + return PCI_ERS_RESULT_NONE; + } + return err; +} + +/** + * link_reset + * PCI Express link has been reset + */ +static pci_ers_result_t link_reset(struct pci_dev *pdev) +{ + pci_ers_result_t err; + err = pci_stub_notify(LINK_RESET, pdev); + switch(err){ + case PCI_ERS_RESULT_NONE: + case PCI_ERS_RESULT_DISCONNECT: + case PCI_ERS_RESULT_RECOVERED: + break; + default: + printk(KERN_INFO "pci_stub: default action in %s",err_to_str[LINK_RESET]); + return PCI_ERS_RESULT_NONE; + } + return err; +} + +/** + * slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch. + */ +static pci_ers_result_t slot_reset(struct pci_dev *pdev) +{ + pci_ers_result_t err; + err = pci_stub_notify(SLOT_RESET, pdev); + switch(err){ + case PCI_ERS_RESULT_NONE: + case PCI_ERS_RESULT_DISCONNECT: + case PCI_ERS_RESULT_RECOVERED: + break; + default: + printk(KERN_INFO "pci_stub: default action in %s",err_to_str[SLOT_RESET]); + return PCI_ERS_RESULT_NONE; + } + return err; +} + +/** + * resume - resume normal operations + * @pdev: Pointer to PCI device + * + * Resume normal operations after an error recovery + * sequence has been completed. + */ +static void resume(struct pci_dev *pdev) +{ + pci_stub_notify(RESUME, pdev); +} + +static struct pci_error_handlers err_handler = { + .error_detected = error_detected, + .mmio_enabled = mmio_enabled, + .link_reset = link_reset, + .slot_reset = slot_reset, + .resume = resume, +}; + +static struct pci_driver driver = { + .name = "uio_pci_stub", + .id_table = NULL, /* only dynamic id's */ + .probe = probe, + .remove = remove, + .err_handler = &err_handler, +}; + +static int __init init(void) +{ + if(timeout > 1000) + return -EINVAL; + + switch(policy){ + case LAZY: + break; + case PARANOID: + case STRICT: + default: + return -EINVAL; + } + + pr_info(DRIVER_DESC " version: " DRIVER_VERSION ": timeout=%d, policy=%s", + timeout, policy==PARANOID?"paranoid":policy==STRICT?"strict":"lazy"); + + return pci_register_driver(&driver); +} + +static void __exit cleanup(void) +{ + pci_unregister_driver(&driver); +} + +module_init(init); +module_exit(cleanup); + +MODULE_VERSION(DRIVER_VERSION); +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR(DRIVER_AUTHOR); +MODULE_DESCRIPTION(DRIVER_DESC); + diff --git a/include/linux/Kbuild b/include/linux/Kbuild index 97319a8..1f494e4 100644 --- a/include/linux/Kbuild +++ b/include/linux/Kbuild @@ -359,6 +359,7 @@ header-y += udf_fs_i.h header-y += udp.h header-y += uinput.h header-y += uio.h +header-y += uio_pci_stub.h header-y += ultrasound.h header-y += un.h header-y += unistd.h diff --git a/include/linux/uio_pci_stub.h b/include/linux/uio_pci_stub.h new file mode 100644 index 0000000..389b3dc --- /dev/null +++ b/include/linux/uio_pci_stub.h @@ -0,0 +1,46 @@ +#ifndef __LINUX_UIO_PCI_STUB_H +#define __LINUX_UIO_PCI_STUB_H + +#ifndef UIO_MAX_NAME_SIZE +#define UIO_MAX_NAME_SIZE 64 +#endif + +#define DEFAULT_LOCKSTEP_TIMEOUT 10 + +#define FORMAT_UIO_DEV_NAME(vendorid,deviceid,busnr,dev,fcn)\ + "%x:%x.%x %x:%x",vendorid,deviceid,busnr,dev,fcn + +enum pci_stub_error_policy{ + PARANOID=1, + STRICT=2, + LAZY=3, +}; + +enum pci_stub_error_code{ + ERROR_DETECTED, + MMIO_ENABLED, + LINK_RESET, + SLOT_RESET, + RESUME, +}; + +enum pci_stub_error_result{ + RESULT_NONE=1, + RESULT_CAN_RECOVER=2, + RESULT_NEED_RESET=3, + RESULT_DISCONNECT=4, + RESULT_RECOVERED=5, +}; + +struct pci_stub_bar0 { + unsigned int err_code; + unsigned int err_code_ack; + unsigned int err_result; + unsigned int vendor; + unsigned int device; + unsigned int busnr; + unsigned int devnr; + unsigned int fcnnr; +}; + +#endif -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html