On 25.11.24 11:00, Wen Gu wrote: >> I wonder if this can deadlock, when you take lock_sock so far down in the callchain. >> example: >> smc_connect will first take lock_sock(sk) and then mutex_lock(&smc_server_lgr_pending); (e.g. in smc_connect_ism()) >> wheras >> smc_listen_work() will take mutex_lock(&smc_server_lgr_pending); and then lock_sock(sk) (in your __smc_conn_abort(,,false)) >> >> I am not sure whether this can be called on the same socket, but it looks suspicious to me. >> > > IMHO this two paths can not occur on the same sk. > >> >> All callers of smc_conn_abort() without socklock seem to originate from smc_listen_work(). >> That makes me think whether smc_listen_work() should do lock_sock(sk) on a higher level. >> > > Yes, I also think about this question, I guess it is because the new smc sock will be > accepted by userspace only after smc_listen_work() is completed. Before that, no userspace > operation occurs synchronously with it, so it is not protected by sock lock. But I am not > sure if there are other reasons, so I did not aggressively protect the entire smc_listen_work > with sock lock, but chose a conservative approach. > >> Do you have an example which function could collide with smc_listen_work()? >> i.e. have you found a way to reproduce this? >> > > We discovered this during our fault injection testing where the rdma driver was rmmod/insmod > sporadically during the nginx/wrk 1K connections test. > > e.g. > > __smc_lgr_terminate | smc_listen_decline > (caused by rmmod mlx5_ib) | (caused by e.g. reg mr fail) > -------------------------------------------------------------- > lock_sock | > smc_conn_kill | smc_conn_abort > \- smc_conn_free | \- smc_conn_free > release_sock | Thank you for the explanations. So the most suspicious scenario is smc_listen_work() colliding with __smc_lgr_terminate() -> smc_conn_kill() of the conn and smc socket that is just under construction by smc_listen_work() (without socklock). I am wondering, if other parts of smc_listen_work() are allowed to run in parallel with smc_conn_kill() of this smc socket?? My impression would be that the whole smc_listen_work() should be protected against smc_conn_kill(), not only smc_conn_free.