On 7/31/23 13:23, Jason Gunthorpe wrote: > On Mon, Jul 31, 2023 at 01:20:35PM -0500, Bob Pearson wrote: >> On 7/31/23 13:12, Jason Gunthorpe wrote: >>> On Fri, Jul 21, 2023 at 03:50:17PM -0500, Bob Pearson wrote: >>>> In cable pull testing some NICs can hold a send packet long enough >>>> to allow ulp protocol stacks to destroy the qp and the cleanup >>>> routines to timeout waiting for all qp references to be released. >>>> When the NIC driver finally frees the SKB the qp pointer is no longer >>>> valid and causes a seg fault in rxe_skb_tx_dtor(). >>>> >>>> This patch passes the qp index instead of the qp to the skb destructor >>>> callback function. The call back is required to lookup the qp from the >>>> index and if it has been destroyed the lookup will return NULL and the >>>> qp will not be referenced avoiding the seg fault. >>> >>> And what if it is a different QP returned? >>> >>> Jason >> >> Since we are using xarray cyclic alloc you would have to create 16M QPs before the >> index was reused. This is as good as it gets I think. > > Sounds terrible, why can't you store the QP pointer instead and hold a > refcount on it? The goal here was to make packet send semantics to be 'fire and forget' i.e. once we send the packet not have any dependencies hanging around. But we still wanted to count the packets pending to avoid overrunning the send queue. This allows lustre to do its normal error recovery and destroy the qp and try to create a new one when it times out. Bob > > Jason