On Mon, Jul 31, 2023 at 01:44:47PM -0500, Bob Pearson wrote: > On 7/31/23 13:32, Jason Gunthorpe wrote: > > On Mon, Jul 31, 2023 at 01:26:23PM -0500, Bob Pearson wrote: > >> On 7/31/23 13:17, Jason Gunthorpe wrote: > >>> On Fri, Jul 21, 2023 at 03:50:22PM -0500, Bob Pearson wrote: > >>>> Network interruptions may cause long delays in the processing of > >>>> send packets during which time the rxe driver may be unloaded. > >>>> This will cause seg faults when the packet is ultimately freed as > >>>> it calls the destructor function in the rxe driver. This has been > >>>> observed in cable pull fail over fail back testing. > >>> > >>> No, module reference counts are only for code that is touching > >>> function pointers. > >> > >> this is exactly the case here. it is the skb destructor function that > >> is carried by the skb. > > > > It can't possibly call it correctly without also having the rxe > > ib_device reference too though?? > > Nope. This was causing seg faults in testing when there was a long network > hang and the admin tried to reload the rxe driver. The skb code doesn't care > about the ib device at all. I don't get it, there aren't globals in rxe, so WTF is it doing if it isn't somehow tracing back to memory that is under the ib_device lifetime? Jason