On Thu, Jun 16, 2022 at 6:01 AM Chaitanya Kulkarni <chaitanyak@xxxxxxxxxx> wrote: > > On 6/15/22 12:47, Bjorn Helgaas wrote: > > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote: > >> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote: > >>> Shinichiro, > >>> > >>> On 6/13/22 19:23, Keith Busch wrote: > >>>> On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote: > >>>>> (CC+: linux-pci) > >>>>> On Jun 11, 2022 / 16:34, Yi Zhang wrote: > >>>>>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@xxxxxxxxxx> wrote: > >>>>>>> > >>>>>>> And I am not even sure this is real. I don't know yet why > >>>>>>> this is showing up only now, but this should fix it: > >>>>>> > >>>>>> Hi Keith > >>>>>> > >>>>>> Confirmed the WARNING issue was fixed with the change, here is > >>>>>> the log: > >>>>> > >>>>> Thanks. I also confirmed that Keith's change to add > >>>>> __ATTR_IGNORE_LOCKDEP to dev_attr_dev_rescan avoids the fix, on > >>>>> v5.19-rc2. > >>>>> > >>>>> I took a closer look into this issue and found The deadlock > >>>>> WARN can be recreated with following two commands: > >>>>> > >>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan > >>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove > >>>>> > >>>>> And it can be recreated with PCI devices other than NVME > >>>>> controller, such as SCSI controller or VGA controller. Then > >>>>> this is not a storage sub-system issue. > >>>>> > >>>>> I checked function call stacks of the two commands above. As > >>>>> shown below, it looks like ABBA deadlock possibility is > >>>>> detected and warned. > >>>> > >>>> Yeah, I was mistaken on this report, so my proposal to suppress > >>>> the warning is definitely not right. If I run both 'echo' > >>>> commands in parallel, I see it deadlock frequently. I'm not > >>>> familiar enough with this code to any good ideas on how to fix, > >>>> but I agree this is a generic pci issue. > >>> > >>> I think it is worth adding a testcase to blktests to make sure > >>> these future releases will test this. > >> > >> Yeah, this WARN is confusing for us then it would be valuable to > >> test by blktests not to repeat it. One point I wonder is: which test > >> group the test case will it fall in? The nvme group could be the > >> group to add, probably. > >> > > since this issue been discovered with nvme rescan and revmoe, > it should be added to the nvme category. We already have nvme/032 which tests nvme rescan/reset/remove and the issue was reported by running this one, do we still need one more? > > >> Another point I wonder is other kernel test suite than blktests. > >> Don't we have more appropriate test suite to check PCI device > >> rescan/remove race ? Such a test sounds more like a PCI bus > >> sub-system test than block/storage test. > > I don't think so we could have caught it long time back, > but we clearly did not. > > > > > I'm not aware of such a test, but it would be nice to have one. > > > > Can you share your qemu config so I can reproduce this locally? > > > > Thanks for finding and reporting this! > > > > Bjorn > > -ck > > -- Best Regards, Yi Zhang