On Sun, 9 Jul 2023 13:50:35 +0200 Eric Auger <eric.auger@xxxxxxxxxx> wrote: > Hi Alex, > > On 7/7/23 17:10, Alex Williamson wrote: > > Unlike default access to config space through sysfs, the vpd read and > > write function don't actively manage the runtime power management state > > of the device during access. Since commit 7ab5e10eda02 ("vfio/pci: Move > > the unused device into low power state with runtime PM"), the vfio-pci > > driver will use runtime power management and release unused devices to > > make use of low power states. Attempting to access VPD information in > > this low power state can result in incorrect information or kernel > > crashes depending on the system behavior. > > > > Wrap the vpd read/write bin attribute handlers in runtime PM and take > > into account the potential quirk to select the correct device to wake. > > This much improved the situation as it is more difficult to hit the > issue. Unfortunately after tens of attempts I was still able to hit a > kernel crash. The console output does not mention the VPD anymore but > PCI power management events (PME). Does combining this with an extended D3hot wakeup for the device make any difference, such as a5a6dd262469 ("PCI/PM: Extend D3hot delay for NVIDIA HDA controllers")? Thanks, Alex > [ 168.616700] CPU: 0 PID: 1409 Comm: kworker/0:5 Not tainted > 6.4.0-vpd-upstream+ #56 > [ 168.624257] Hardware name: GIGABYTE R181-T90-00/MT91-FS1-00, BIOS F34 > 08/13/2020 > [ 168.631639] Workqueue: events_freezable pci_pme_list_scan > [ 168.637032] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS > BTYPE=--) > [ 168.643982] pc : pci_generic_config_read+0x64/0xd8 > [ 168.648763] lr : pci_generic_config_read+0x28/0xd8 > [ 168.653542] sp : ffff80008caebcb0 > [ 168.656844] x29: ffff80008caebcb0 x28: 0000000000000000 x27: > 0000000000000000 > [ 168.663969] x26: 0000000000000000 x25: 0000000000000000 x24: > ffff80008caebd76 > [ 168.671094] x23: ffff0008063fd800 x22: 0000000000000044 x21: > ffff80008232d4c8 > [ 168.678218] x20: ffff80008caebd24 x19: 0000000000000002 x18: > 00000000000040fc > [ 168.685342] x17: 00000000000040f8 x16: 0000000000000000 x15: > 0000000000000001 > [ 168.692466] x14: ffffffffffffffff x13: 0000000000000000 x12: > 0101010101010101 > [ 168.699590] x11: 7f7f7f7f7f7f7f7f x10: fefefefefefefeff x9 : > ffff8000807b3938 > [ 168.706714] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : > 000000000000007f > [ 168.713838] x5 : 0000000000000000 x4 : ffff800090000000 x3 : > 0000000000000000 > [ 168.720962] x2 : 0000000000000044 x1 : 0000000000c00044 x0 : > ffff800090c00044 > [ 168.728086] Call trace: > [ 168.730520] pci_generic_config_read+0x64/0xd8 > [ 168.734953] pci_bus_read_config_word+0x84/0xe8 > [ 168.739471] pci_read_config_word+0x30/0x50 > [ 168.743642] pci_check_pme_status+0x70/0xa8 > [ 168.747813] pci_pme_list_scan+0x84/0x158 > [ 168.751811] process_one_work+0x1ec/0x488 > [ 168.755810] worker_thread+0x48/0x400 > [ 168.759461] kthread+0x10c/0x120 > [ 168.762678] ret_from_fork+0x10/0x20 > [ 168.766245] Code: 52800000 a94153f3 a8c27bfd d65f03c0 (79400000) > [ 168.772326] ---[ end trace 0000000000000000 ]--- > [ 168.776931] Kernel panic - not syncing: synchronous external abort: > Fatal exception > [ 168.784574] SMP: stopping secondary CPUs > [ 169.831001] SMP: failed to stop secondary CPUs 0,199 > [ 169.835955] Kernel Offset: 0x190000 from 0xffff800080000000 > > > Thanks > > Eric > > > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > > --- > > drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++-- > > 1 file changed, 32 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c > > index a4fc4d0690fe..81217dd4789f 100644 > > --- a/drivers/pci/vpd.c > > +++ b/drivers/pci/vpd.c > > @@ -275,8 +275,23 @@ static ssize_t vpd_read(struct file *filp, struct kobject *kobj, > > size_t count) > > { > > struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj)); > > + struct pci_dev *vpd_dev = dev; > > + ssize_t ret; > > + > > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { > > + vpd_dev = pci_get_func0_dev(dev); > > + if (!vpd_dev) > > + return -ENODEV; > > + } > > + > > + pci_config_pm_runtime_get(vpd_dev); > > + ret = pci_read_vpd(vpd_dev, off, count, buf); > > + pci_config_pm_runtime_put(vpd_dev); > > + > > + if (dev != vpd_dev) > > + pci_dev_put(vpd_dev); > > > > - return pci_read_vpd(dev, off, count, buf); > > + return ret; > > } > > > > static ssize_t vpd_write(struct file *filp, struct kobject *kobj, > > @@ -284,8 +299,23 @@ static ssize_t vpd_write(struct file *filp, struct kobject *kobj, > > size_t count) > > { > > struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj)); > > + struct pci_dev *vpd_dev = dev; > > + ssize_t ret; > > + > > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { > > + vpd_dev = pci_get_func0_dev(dev); > > + if (!vpd_dev) > > + return -ENODEV; > > + } > > + > > + pci_config_pm_runtime_get(vpd_dev); > > + ret = pci_write_vpd(vpd_dev, off, count, buf); > > + pci_config_pm_runtime_put(vpd_dev); > > + > > + if (dev != vpd_dev) > > + pci_dev_put(vpd_dev); > > > > - return pci_write_vpd(dev, off, count, buf); > > + return ret; > > } > > static BIN_ATTR(vpd, 0600, vpd_read, vpd_write, 0); > > >