On Sun, Oct 17, 2021 at 05:29:39PM +0300, Yishai Hadas wrote: > On 10/16/2021 12:12 AM, Alex Williamson wrote: > > On Fri, 15 Oct 2021 17:03:28 -0300 > > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > > > On Fri, Oct 15, 2021 at 01:52:37PM -0600, Alex Williamson wrote: > > > > On Wed, 13 Oct 2021 12:47:06 +0300 > > > > Yishai Hadas <yishaih@xxxxxxxxxx> wrote: > > > > > Add infrastructure to let vfio_pci_core drivers trap device RESET. > > > > > > > > > > The motivation for this is to let the underlay driver be aware that > > > > > reset was done and set its internal state accordingly. > > > > I think the intention of the uAPI here is that the migration error > > > > state is exited specifically via the reset ioctl. Maybe that should be > > > > made more clear, but variant drivers can already wrap the core ioctl > > > > for the purpose of determining that mechanism of reset has occurred. > > > It is not just recovering the error state. > > > > > > Any transition to reset changes the firmware state. Eg if userspace > > > uses one of the other emulation paths to trigger the reset after > > > putting the device off running then the driver state and FW state > > > become desynchronized. > > > > > > So all the reset paths need to be synchronized some how, either > > > blocked while in non-running states or aligning the SW state with the > > > new post-reset FW state. > > This only catches the two flavors of FLR and the RESET ioctl itself, so > > we've got gaps relative to "all the reset paths" anyway. I'm also > > concerned about adding arbitrary callbacks for every case that it gets > > too cumbersome to write a wrapper for the existing callbacks. > > > > However, why is this a vfio thing when we have the > > pci_error_handlers.reset_done callback. At best this ought to be > > redundant to that. Thanks, > > > > Alex > > > Alex, > > How about the below patch instead ? > > This will centralize the 'reset_done' notifications for drivers to one place > (i.e. pci_error_handlers.reset_done) and may close the gap that you pointed > on. > > I just followed the logic in vfio_pci_aer_err_detected() from usage and > locking point of view. > > Do we really need to take the &vdev->igate mutex as was done there ? > > The next patch from the series in mlx5 will stay as of in V1, it may just > set its ops and be called upon PCI 'reset_done'. > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c > b/drivers/vfio/pci/vfio_pci_core.c > index e581a327f90d..20bf37c00fb6 100644 > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -1925,6 +1925,27 @@ static pci_ers_result_t > vfio_pci_aer_err_detected(struct pci_dev *pdev, > return PCI_ERS_RESULT_CAN_RECOVER; > } > > +static void vfio_pci_aer_err_reset_done(struct pci_dev *pdev) > +{ > + struct vfio_pci_core_device *vdev; > + struct vfio_device *device; > + > + device = vfio_device_get_from_dev(&pdev->dev); > + if (device == NULL) > + return; Do not add new vfio_device_get_from_dev() calls, this should extract it from the pci_get_drvdata. > + > + vdev = container_of(device, struct vfio_pci_core_device, vdev); > + > + mutex_lock(&vdev->igate); > + if (vdev->ops && vdev->ops->reset_done) > + vdev->ops->reset_done(vdev); > + mutex_unlock(&vdev->igate); > + > + vfio_device_put(device); > + > + return; > +} > + > int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn) > { > struct vfio_device *device; > @@ -1947,6 +1968,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure); > > const struct pci_error_handlers vfio_pci_core_err_handlers = { > .error_detected = vfio_pci_aer_err_detected, > + .reset_done = vfio_pci_aer_err_reset_done, > }; > EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers); Most likely mlx5vf should just implement a pci_error_handlers struct and install vfio_pci_aer_err_detected in it. Jason