On 8/1/23 17:56, Jason Gunthorpe wrote: > On Mon, Jul 31, 2023 at 01:44:47PM -0500, Bob Pearson wrote: >> On 7/31/23 13:32, Jason Gunthorpe wrote: >>> On Mon, Jul 31, 2023 at 01:26:23PM -0500, Bob Pearson wrote: >>>> On 7/31/23 13:17, Jason Gunthorpe wrote: >>>>> On Fri, Jul 21, 2023 at 03:50:22PM -0500, Bob Pearson wrote: >>>>>> Network interruptions may cause long delays in the processing of >>>>>> send packets during which time the rxe driver may be unloaded. >>>>>> This will cause seg faults when the packet is ultimately freed as >>>>>> it calls the destructor function in the rxe driver. This has been >>>>>> observed in cable pull fail over fail back testing. >>>>> >>>>> No, module reference counts are only for code that is touching >>>>> function pointers. >>>> >>>> this is exactly the case here. it is the skb destructor function that >>>> is carried by the skb. >>> >>> It can't possibly call it correctly without also having the rxe >>> ib_device reference too though?? >> >> Nope. This was causing seg faults in testing when there was a long network >> hang and the admin tried to reload the rxe driver. The skb code doesn't care >> about the ib device at all. > > I don't get it, there aren't globals in rxe, so WTF is it doing if it > isn't somehow tracing back to memory that is under the ib_device > lifetime? > > Jason When the rxe driver builds a send packet it puts the address of its destructor subroutine in the skb before calling ip_local_out and sending it. The address of driver software is now hanging around. If you don't delay the module exit routine until all the skb's are freed you can cause seg faults. The only way to cause this to happen is to call rmmod on the driver too early but people have done this occasionally and report it as a bug. Bob