On Tue, Apr 07, 2020 at 02:39:42PM +0200, Dmitry Vyukov wrote: > On Tue, Apr 7, 2020 at 1:55 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > > > On Tue, Apr 07, 2020 at 11:56:30AM +0200, Dmitry Vyukov wrote: > > > > I'm not sure what could be done wrong here to elicit this: > > > > > > > > sysfs group 'power' not found for kobject 'umad1' > > > > > > > > ?? > > > > > > > > I've seen another similar sysfs related trigger that we couldn't > > > > figure out. > > > > > > > > Hard to investigate without a reproducer. > > > > > > Based on all of the sysfs-related bugs I've seen, my bet would be on > > > some races. E.g. one thread registers devices, while another > > > unregisters these. > > > > I did check that the naming is ordered right, at least we won't be > > concurrently creating and destroying umadX sysfs of the same names. > > > > I'm also fairly sure we can't be destroying the parent at the same > > time as this child. > > > > Do you see the above commonly? Could it be some driver core thing? Or > > is it more likely something wrong in umad? > > Mmmm... I can't say, I am looking at some bugs very briefly. I've > noticed that sysfs comes up periodically (or was it some other similar > fs?). Hmm.. Looking at the git history I see several cases where there are ordering problems. I wonder if the rdma parent device is being destroyed before the rdma devices complete destruction? I see the syzkaller is creating a bunch of virtual net devices, and I assume it has created a software rdma device on one of these virtual devices. So I'm guessing that it is also destroying a parent? But I can't guess which.. Some simple tests with veth suggest it is OK because the parent is virtual. But maybe bond or bridge or something? The issue in rdma is that unregistering a netdev triggers an async destruction of the RDMA devices. This has to be async because the netdev notification is delivered with RTNL held, and a rdma device cannot be destroyed while holding RTNL. So there is a race, I suppose, where the netdev can complete destruction while rdma continues, and if someone deletes the sysfs holding the netdev before rdma completes, I'm going to guess, that we hit this warning? Could it be? I would love to know what netdev the rdma device was created on, but it doesn't seem to show in the trace :\ This theory could be made more likely by adding a sleep to ib_unregister_work() to increase the race window - is there some way to get syzkaller to search for a reproducer with that patch? Jason