Hi Alex, Apologies for the delay. This was due to two parts: * I studied these code paths more and I think there may be one more flow that can be made more robust (more later), * I spent a lot of time trying to trigger all the affected code paths but here I did not have much luck. I would prefer to run tests that trigger these flows so that I can get some help from the kernel lock debugging. From what I can tell the irqfd flows can be triggered with the help of the kernel-irqchip option to Qemu but on the hardware I have access to I could only try kernel-irqchip=auto and that did not trigger the flows (the irqfd is set up but the eventfd is never signaled). I am not familiar with this area, do you perhaps have guidance on how I can test these flows or do you perhaps have access to needed environments to test this? On 2/8/2024 1:08 PM, Alex Williamson wrote: > On Wed, 7 Feb 2024 15:30:15 -0800 > Reinette Chatre <reinette.chatre@xxxxxxxxx> wrote: >> I studied the code more and have one more observation related to this portion >> of the flow: >> From what I can tell this change makes the INTx code more robust. If I >> understand current implementation correctly it seems possible to enable >> INTx but not have interrupt allocated. In this case the interrupt context >> (ctx) will exist but ctx->trigger will be NULL. Current >> vfio_pci_set_intx_trigger()->vfio_send_intx_eventfd() only checks if >> ctx is valid. It looks like it may call eventfd_signal(NULL) where >> pointer is dereferenced. >> >> If this is correct then I think a separate fix that can easily be >> backported may be needed. Something like: > > Good find. I think it's a bit more complicated though. There are > several paths to vfio_send_intx_eventfd: > > - vfio_intx_handler > > This can only be called between request_irq() and free_irq() > where trigger is always valid. igate is not held. > > - vfio_pci_intx_unmask > > Callers hold igate, additional test of ctx->trigger makes this > safe. Two callers of vfio_pci_intx_unmask() do not seem to hold igate: vfio_basic_config_write() and vfio_pci_core_runtime_resume(). Considering this I wonder if we could add something like below to the solution you propose. On a high level the outside callers (VFIO PCI core) will keep using vfio_pci_intx_unmask() that will now take igate while the interrupt management code gets a new internal __vfio_pci_intx_unmask() that should be called with igate held. This results in: @@ -215,12 +223,20 @@ static int vfio_pci_intx_unmask_handler_virqfd(void *opaque, void *unused) return ret; } -void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev) +static void __vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev) { + lockdep_assert_held(&vdev->igate); if (vfio_pci_intx_unmask_handler(vdev, NULL) > 0) vfio_send_intx_eventfd(vdev, NULL); } +void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev) +{ + mutex_lock(&vdev->igate); + __vfio_pci_intx_unmask(vdev); + mutex_unlock(&vdev->igate); +} + static irqreturn_t vfio_intx_handler(int irq, void *dev_id) { struct vfio_pci_core_device *vdev = dev_id; @@ -581,11 +597,11 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_core_device *vdev, return -EINVAL; if (flags & VFIO_IRQ_SET_DATA_NONE) { - vfio_pci_intx_unmask(vdev); + __vfio_pci_intx_unmask(vdev); } else if (flags & VFIO_IRQ_SET_DATA_BOOL) { uint8_t unmask = *(uint8_t *)data; if (unmask) - vfio_pci_intx_unmask(vdev); + __vfio_pci_intx_unmask(vdev); } else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) { struct vfio_pci_irq_ctx *ctx = vfio_irq_ctx_get(vdev, 0); int32_t fd = *(int32_t *)data; > > - vfio_pci_set_intx_trigger > > Same as above. > > - Through unmask eventfd (virqfd) > > Here be dragons. > > In the virqfd case, a write to the eventfd calls virqfd_wakeup() where > we'll call the handler, vfio_pci_intx_unmask_handler(), and based on > the result schedule the thread, vfio_send_intx_eventfd(). Both of > these look suspicious. They're not called under igate, so testing > ctx->trigger doesn't resolve the race. > > I think an option is to wrap the virqfd entry points in igate where we > can then do something similar to your suggestion. I don't think we > want to WARN_ON(!ctx->trigger) because that's then a user reachable > condition. Instead we can just quietly follow the same exit paths. > > I think that means we end up with something like this: > > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c > index 237beac83809..ace7e1dbc607 100644 > --- a/drivers/vfio/pci/vfio_pci_intrs.c > +++ b/drivers/vfio/pci/vfio_pci_intrs.c > @@ -92,12 +92,21 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused) > struct vfio_pci_irq_ctx *ctx; > > ctx = vfio_irq_ctx_get(vdev, 0); > - if (WARN_ON_ONCE(!ctx)) > + if (WARN_ON_ONCE(!ctx) || !ctx->trigger) > return; > eventfd_signal(ctx->trigger); > } > } > > +static void vfio_send_intx_eventfd_virqfd(void *opaque, void *unused) > +{ > + struct vfio_pci_core_device *vdev = opaque; > + > + mutex_lock(&vdev->igate); > + vfio_send_intx_eventfd(opaque, unused); > + mutex_unlock(&vdev->igate); > +} > + > /* Returns true if the INTx vfio_pci_irq_ctx.masked value is changed. */ > bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) > { > @@ -170,7 +179,7 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused) > } > > ctx = vfio_irq_ctx_get(vdev, 0); > - if (WARN_ON_ONCE(!ctx)) > + if (WARN_ON_ONCE(!ctx) || !ctx->trigger) > goto out_unlock; > > if (ctx->masked && !vdev->virq_disabled) { > @@ -194,6 +203,18 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused) > return ret; > } > > +static int vfio_pci_intx_unmask_handler_virqfd(void *opaque, void *unused) > +{ > + struct vfio_pci_core_device *vdev = opaque; > + int ret; > + > + mutex_lock(&vdev->igate); > + ret = vfio_pci_intx_unmask_handler(opaque, unused); > + mutex_unlock(&vdev->igate); > + > + return ret; > +} > + > void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev) > { > if (vfio_pci_intx_unmask_handler(vdev, NULL) > 0) > @@ -572,10 +593,10 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_core_device *vdev, > if (WARN_ON_ONCE(!ctx)) > return -EINVAL; > if (fd >= 0) > - return vfio_virqfd_enable((void *) vdev, > - vfio_pci_intx_unmask_handler, > - vfio_send_intx_eventfd, NULL, > - &ctx->unmask, fd); > + return vfio_virqfd_enable((void *)vdev, > + vfio_pci_intx_unmask_handler_virqfd, > + vfio_send_intx_eventfd_virqfd, NULL, > + &ctx->unmask, fd); > > vfio_virqfd_disable(&ctx->unmask); > } > > > WDYT? This looks good to me. Thank you very much for taking the time to write it. Reinette