On 4/8/22 12:52, Jason Gunthorpe wrote: > On Mon, Apr 04, 2022 at 04:50:53PM -0500, Bob Pearson wrote: >> In the tasklets (completer, responder, and requester) check the >> return value from rxe_get() to detect failures to get a reference. >> This only occurs if the qp has had its reference count drop to >> zero which indicates that it no longer should be used. This is >> in preparation to an upcoming change that will move the qp cleanup >> code to rxe_qp_cleanup(). > > These need some comments explaining how this is safe.. > > It looks to me like it works because the 0 ref keeps the memory alive > while a work queue triggers rxe_cleanup_task() (though who fences the > responder task?) > > At least after the next patch, I'm a little unclear how this works > at this moment.. > > Jason I started writing the comment (here) If rxe_get() fails qp is not going to be around for long because its ref count has gone to zero and rxe_complete() is cleaning up and returning to rdma-core which will free the qp. However rxe_do_qp_cleanup() has to finish first and it will wait for the tasklets to finish running. This fixes a hard bug to solve since the code calling rxe_run_task() will hold a valid reference on qp but the tasklet can be deferred until later and that reference may be gone when the tasklet starts. but I realized that at the end of the day there isn't a problem because complete/wait_for_completion together with qp cleanup code shutting down the tasklets means that the race won't happen once the series is all in place. So I will just drop that patch. Bob