On 11/01/2022 17:34, Wen Gu wrote: > Thanks for your reply. > > On 2022/1/11 6:03 pm, Karsten Graul wrote: >> On 10/01/2022 10:38, Wen Gu wrote: >>> We encountered a crash in smc_setsockopt() and it is caused by >>> accessing smc->clcsock after clcsock was released. >>> >> >> In the switch() the function smc_switch_to_fallback() might be called which also >> accesses smc->clcsock without further checking. This should also be protected then? >> Also from all callers of smc_switch_to_fallback() ? >> >> There are more uses of smc->clcsock (e.g. smc_bind(), ...), so why does this problem >> happen in setsockopt() for you only? I suspect it depends on the test case. >> > > Yes, it depends on the test case. The crash described here only happens one time when > I run a stress test of nginx/wrk, accompanied with frequent RNIC up/down operations. > > Considering accessing smc->clcsock after its release is an uncommon, low probability > issue and only happens in setsockopt() in my test, I choce an simple way to fix it, using > the existing clcsock_release_lock, and only check in smc_setsockopt() and smc_getsockopt(). > >> I wonder if it makes sense to check and protect smc->clcsock at all places in the code where >> it is used... as of now we had no such races like you encountered. But I see that in theory >> this problem could also happen in other code areas. >> > > But inspired by your questions, I think maybe we should treat the race as a general problem? > Do you think it is necessary to find all the potential race related to the clcsock release and > fix them in a unified approach? like define smc->clcsock as RCU pointer, hold rcu read lock > before accessing smc->clcsock and call synchronize_rcu() before resetting smc->clcsock? just a rough idea :) > > Or we should decide it later, do some more tests to see the probability of recurrence of this problem? I like the idea to use RCU with rcu_assign_pointer() to protect that pointer! Lets go with your initial patch (improved to address the access in smc_switch_to_fallback()) for now because it solves your current problem. I put that RCU thing on our list, but if either of us here starts working on that please let the others know so we don't end up doing parallel work on this. But I doubt that we will be able to start working on that soon. Thanks for the good idea!