On 2024/7/1 11:03, Damien Le Moal wrote: > On 6/24/24 21:10, Yihang Li wrote: >>> Thank you for the explanation, but as Niklas said, it would be a lot easier for >>> me to recreate the issue if you send the exact commands you execute to trigger >>> the issue. E.g. "suspend all disks" in step a can have a lot of different >>> meaning depending on which type os suspend you are using... So please send the >>> exact commands you use. >>> is what exactly ? autosuspend ? or something else ? > > I am failing to recreate the exact same issue. I do see a lot of bad things > happening though, but that is not looking like what you sent. I do endup with > the 4 drives connected on my HBA being disabled by libata as revalidate/IDENTIFY > fails. And even worse: I hit a deadlock on dev->mutex when I try to do "rmmod > pm80xx" after running your test. > > I am using a pm80xx adapter as that is the only libsas adapter I have. > > I think your test just kicked a big can of worms... There seem to be a lot of > wrong things going on, but I now need to sort out if the problems are with the > pm80xx driver, libsas, libata or sd. Probably a combination of all. > > ATA device suspend/resume has been a constant source of issues since scsi layer > switched to doing PM operations asynchronouly. Your issue is latest one. > This will take a while to debug. > >> In step a, I suspend all disks by issuing the following command to all disks >> attached to the SAS controller 0000:b4:02.0: >> [root@localhost ~]# echo auto > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:0/end_device-6:0/target6:0:0/6:0:0:0/power/control >> [root@localhost ~]# echo 5000 > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:0/end_device-6:0/target6:0:0/6:0:0:0/power/autosuspend_delay_ms >> ... >> [root@localhost ~]# echo auto > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:6/end_device-6:6/target6:0:6/6:0:6:0/power/control >> [root@localhost ~]# echo 5000 > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:6/end_device-6:6/target6:0:6/6:0:6:0/power/autosuspend_delay_ms > > This works as expected on my system and I see my drives going to sleep after 5s. > >> Step b, Suspend the SAS controller: >> [root@localhost ~]# echo auto > /sys/devices/pci0000:b4/0000:b4:02.0/power/control > > This has no effect for me. Can you confirm that your controller is actually > sleeping ? I.e., what do the following show ? > > cat /sys/devices/pci0000:b4/0000:b4:02.0/power/runtime_active_kids > cat /sys/devices/pci0000:b4/0000:b4:02.0/power/runtime_status I don't have a sysfs node for runtime_active_kids in my system. My controller runtime_status has changed to "suspended" after step b. [root@localhost ~]# cat /sys/devices/pci0000:b4/0000:b4:02.0/power/runtime_status suspended > > ? > >> At this point, the SAS controller is suspended. Next step c is trigger PCI FLR. >> [root@localhost ~]# echo 1 > /sys/bus/pci/devices/0000:b4:02.0/reset > > What does > > cat /sys/bus/pci/devices/0000:b4:02.0/reset_method > > is on your system ? > > Mine is "bus" only. The results in my system are as follows: [root@localhost ~]# cat /sys/devices/pci0000:b4/0000:b4:02.0/reset_method acpi flr pm > >>>> The issue 2: >>>> a. Suspend all disks on controller B. >>>> b. Suspend controller B. >>>> c. Resuming all disks on controller B. >>>> d. Run the "lsmod" command to check the driver reference counting. > > What is the reference count before you do step (a), after you run step (b) and > at step (d) ? Before step a, the hisi_sas driver reference count is 0. After step b, the driver reference count is 0. At step d, the reference count is 2405 (this value is not the same every time). hisi_sas_v3_hw 77824 2405 hisi_sas_main 45056 1 hisi_sas_v3_hw libsas 98304 2 hisi_sas_v3_hw,hisi_sas_main > > For my system using the pm80xx driver, I get: > > pm80xx 352256 0 > libsas 155648 1 pm80xx > > before and after, and that is all normal. But there is the difference that > suspending the pm80xx controller does not seem to be supported and does nothing. >