On 6/27/24 00:15, Bjorn Helgaas wrote: >>> Yes, I am talking about the PCI "Function Level Reset" >>> >>>> FLR and disk/controller suspend execution timing are unrelated. FLR can be >>>> triggered at any time through sysfs. So please give details here. Why is FLR >>>> done when the system is being suspended ? >>> >>> Yes, it is because FLR can be triggered at any time that we are testing the >>> reliability of executing FLR commands after disk/controller suspended. >> >> "can be triggered" ? FLR is not a random asynchronous event. It is an action >> that is *issued* by a user with sys admin rights. And such users can do a lot >> of things that can break a machine... >> >> I fail to see the point of doing a function reset while the device is >> suspended. But granted, I guess the device should comeback up in such case, >> though I would like to hear what the PCI guys have to say about this. >> >> Bjorn, >> >> Is reseting a suspended PCI device something that should be/is supported ? > > I doubt it. The PCI core should be preserving all the generic PCI > state across suspend/resume. The driver should only need to > save/restore device-specific things the PCI core doesn't know about. > > A reset will clear out most state, and the driver doesn't know the > reset happened, so it will expect most device state to have been > preserved. That is what I suspected. However, checking the code, reset_store() in pci-sysfs.c does: pm_runtime_get_sync(dev); result = pci_reset_function(pdev); pm_runtime_put(dev); and pm_runtime_get_sync() calls __pm_runtime_resume() which will resume a suspended device. So while I still think it is not a good idea to reset a suspended device, things should still work as execpected and not cause any problem with the device state, right ? Yihang, I think that the issue at hand here is that once the reset finishes, the controller goes back to suspended state, and I suspect that is because of the "auto" setting for its power/control. That triggers because the FLR is done after the controller resumed but *before* the revalidation of the drives connected to it completes. So FLR makes the revalidation fail (scsi scan/revalidation is asynchronous...). This seems to me to be the expected behavior for what you are doing and I fail to see how that ever worked correctly, even before 0c76106cb975 and 626b13f015e0. Could you try this: add a call to msleep(30000) at the end of _resume_v3_hw(). I.e.: diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c index feda9b54b443..54224568d749 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c @@ -5104,6 +5104,8 @@ static int _resume_v3_hw(struct device *device) dev_warn(dev, "end of resuming controller\n"); + msleep(30000); + return 0; } To see if it makes any difference to actually wait for the connected disks to resume correctly before doing the FLR. -- Damien Le Moal Western Digital Research