On Tue, Apr 7, 2020 at 4:35 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > On Tue, Apr 07, 2020 at 02:39:42PM +0200, Dmitry Vyukov wrote: > > On Tue, Apr 7, 2020 at 1:55 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > > > > > On Tue, Apr 07, 2020 at 11:56:30AM +0200, Dmitry Vyukov wrote: > > > > > I'm not sure what could be done wrong here to elicit this: > > > > > > > > > > sysfs group 'power' not found for kobject 'umad1' > > > > > > > > > > ?? > > > > > > > > > > I've seen another similar sysfs related trigger that we couldn't > > > > > figure out. > > > > > > > > > > Hard to investigate without a reproducer. > > > > > > > > Based on all of the sysfs-related bugs I've seen, my bet would be on > > > > some races. E.g. one thread registers devices, while another > > > > unregisters these. > > > > > > I did check that the naming is ordered right, at least we won't be > > > concurrently creating and destroying umadX sysfs of the same names. > > > > > > I'm also fairly sure we can't be destroying the parent at the same > > > time as this child. > > > > > > Do you see the above commonly? Could it be some driver core thing? Or > > > is it more likely something wrong in umad? > > > > Mmmm... I can't say, I am looking at some bugs very briefly. I've > > noticed that sysfs comes up periodically (or was it some other similar > > fs?). > > Hmm.. > > Looking at the git history I see several cases where there are > ordering problems. I wonder if the rdma parent device is being > destroyed before the rdma devices complete destruction? > > I see the syzkaller is creating a bunch of virtual net devices, and I > assume it has created a software rdma device on one of these virtual > devices. > > So I'm guessing that it is also destroying a parent? But I can't guess > which.. Some simple tests with veth suggest it is OK because the > parent is virtual. But maybe bond or bridge or something? > > The issue in rdma is that unregistering a netdev triggers an async > destruction of the RDMA devices. This has to be async because the > netdev notification is delivered with RTNL held, and a rdma device > cannot be destroyed while holding RTNL. > > So there is a race, I suppose, where the netdev can complete > destruction while rdma continues, and if someone deletes the sysfs > holding the netdev before rdma completes, I'm going to guess, that we > hit this warning? > > Could it be? I would love to know what netdev the rdma device was > created on, but it doesn't seem to show in the trace :\ > > This theory could be made more likely by adding a sleep to > ib_unregister_work() to increase the race window - is there some way > to get syzkaller to search for a reproducer with that patch? Bad it happened in kthread context. Otherwise it's usually possible to pinpoint the test based on process name. syz-repro utility will do reproduction process with a any kernel you give it: https://github.com/google/syzkaller/blob/master/docs/reproducing_crashes.md Or it's possible to run individual programs, or whole log with syz-execprog utility: https://github.com/google/syzkaller/blob/master/docs/executing_syzkaller_programs.md Or maybe you could pinpoint the guilty test program by hand in the log (it's probably somewhere closer to the end): https://syzkaller.appspot.com/x/log.txt?x=119dd16de00000