On Wed, Nov 04, 2020 at 05:08:59PM -0500, Ryan Stone wrote: > If anybody could give input on my analysis and the proposed solution > I'd really appreciate it. Yikes, this whole thing is just wrong.. 1) We can't migrate QP's across devices. So av and alt_av must be in the same cm_dev, we never check this when forming the AV and alt AV's during LAP. Wee 2) cm_remove_one needs to remove all the cm_dev's because it is going to kfree them. Using altr_send_port_not_ready is foolish because what we really want is to NULL the port pointer (we are freeing that too) 3) Touching the AV after cm_remove_one(), eg for rdma_destroy_ah_attr() is wrong. The AV is part of the cm_dev and has to be cleaned up before the cm_remove_one can return. 4) The flush_workqueue() in cm_remove_one is wishful thinking, there are many places still using the mad_agent that are not on that workqueue. A proper 'av_lock' rwsem going to be needed here Which is another example of why every time I see some idiodic 'is_closed' flag it is just a sign of wrong, wrong, wrong. Fixing it requires a full audit of all the places using the AV :\ Jason