On Thu, Jan 03, 2019 at 08:16:24PM +0200, Leon Romanovsky wrote: > On Thu, Jan 03, 2019 at 11:03:58AM -0700, Jason Gunthorpe wrote: > > On Thu, Jan 03, 2019 at 10:40:34AM +0200, Leon Romanovsky wrote: > > > On Wed, Jan 02, 2019 at 01:07:30PM -0700, Jason Gunthorpe wrote: > > > > On Wed, Jan 02, 2019 at 09:30:24PM +0200, Leon Romanovsky wrote: > > > > > > > > > > struct ib_umad_file { > > > > > > struct mutex mutex; > > > > > > struct ib_umad_port *port; > > > > > > struct list_head recv_list; > > > > > > struct list_head send_list; > > > > > > struct list_head port_list; > > > > > > spinlock_t send_lock; > > > > > > wait_queue_head_t recv_wait; > > > > > > struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; > > > > > > int agents_dead; > > > > > > u8 use_pkey_index; > > > > > > u8 already_used; > > > > > > }; > > > > > > > > > > > > No 'ib_umad_device' in there. > > > > > > > > > > At the beginning of ib_umad_close(), there is a line: > > > > > struct ib_umad_device *dev = file->port->umad_dev; > > > > > > > > > > Because "dev" is not used before ib_umad_dev_put(), compiler probably > > > > > defers evaluation of dev pointer till the end, after "kfree(file);". > > > > > > > > C does not allow that kind of re-ordering. > > > > > > Do you have specific section in mind? > > > > Anthing about sequence points :) > > > > Any function call can change any non-local value, so the compiler is > > not permitted to re-order loads across function calls unless it can > > prove that the value is local or cannot be changed by the function. > > > > > > > I don't know what about others, but for us this patch gave two weeks > > > > > of clean runs without traces, which is not the case without it. > > > > > > > > Come up with an explanation why it works then? > > > > > > We always have bug in that area and my kfree() added the "old" delay? > > > > Maybe.. Or maybe syzkaller is just being sketchy? Was there a reliable > > reproducer? > > It was observed in verification runs and not in syzkaller setup. > > We didn't find reliable reproducer and it didn't occur on single > machine, so only in massive runs (timing ???). I think there is an unbalanced ib_umad_dev_put() someplace in this code, or memory corruption. The trace showed that ib_umad_dev_put() in ib_umad_close() freed the ib_umad_device, which is not allowed, the cdev core code should be holding a kref on it via the kref it holds on the struct ib_umad_port.. I don't see any unbalanced puts, so I'm wondering about the memory corruption possibility.. I'd say drop the redundant put like I suggested and see if we can get a kasn hit on the actual free point? Jason