On 6/15/22 12:47, Bjorn Helgaas wrote: > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote: >> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote: >>> Shinichiro, >>> >>> On 6/13/22 19:23, Keith Busch wrote: >>>> On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote: >>>>> (CC+: linux-pci) >>>>> On Jun 11, 2022 / 16:34, Yi Zhang wrote: >>>>>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@xxxxxxxxxx> wrote: >>>>>>> >>>>>>> And I am not even sure this is real. I don't know yet why >>>>>>> this is showing up only now, but this should fix it: >>>>>> >>>>>> Hi Keith >>>>>> >>>>>> Confirmed the WARNING issue was fixed with the change, here is >>>>>> the log: >>>>> >>>>> Thanks. I also confirmed that Keith's change to add >>>>> __ATTR_IGNORE_LOCKDEP to dev_attr_dev_rescan avoids the fix, on >>>>> v5.19-rc2. >>>>> >>>>> I took a closer look into this issue and found The deadlock >>>>> WARN can be recreated with following two commands: >>>>> >>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan >>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove >>>>> >>>>> And it can be recreated with PCI devices other than NVME >>>>> controller, such as SCSI controller or VGA controller. Then >>>>> this is not a storage sub-system issue. >>>>> >>>>> I checked function call stacks of the two commands above. As >>>>> shown below, it looks like ABBA deadlock possibility is >>>>> detected and warned. >>>> >>>> Yeah, I was mistaken on this report, so my proposal to suppress >>>> the warning is definitely not right. If I run both 'echo' >>>> commands in parallel, I see it deadlock frequently. I'm not >>>> familiar enough with this code to any good ideas on how to fix, >>>> but I agree this is a generic pci issue. >>> >>> I think it is worth adding a testcase to blktests to make sure >>> these future releases will test this. >> >> Yeah, this WARN is confusing for us then it would be valuable to >> test by blktests not to repeat it. One point I wonder is: which test >> group the test case will it fall in? The nvme group could be the >> group to add, probably. >> since this issue been discovered with nvme rescan and revmoe, it should be added to the nvme category. >> Another point I wonder is other kernel test suite than blktests. >> Don't we have more appropriate test suite to check PCI device >> rescan/remove race ? Such a test sounds more like a PCI bus >> sub-system test than block/storage test. I don't think so we could have caught it long time back, but we clearly did not. > > I'm not aware of such a test, but it would be nice to have one. > > Can you share your qemu config so I can reproduce this locally? > > Thanks for finding and reporting this! > > Bjorn -ck