On Wed, Dec 29, 2021 at 01:51:27PM +0100, Karsten Graul wrote: >On 28/12/2021 16:13, Wen Gu wrote: >> We encountered some crashes caused by the race between SMC-R >> link access and link clear triggered by link group termination >> in abnormal case, like port error. > >Without to dig deeper into this, there is already a refcount for links, see smc_wr_tx_link_hold(). >In smc_wr_free_link() there are waits for the refcounts to become zero. > >Why do you need to introduce another refcounting instead of using the existing? >And if you have a good reason, do we still need the existing refcounting with your new >implementation? > >Maybe its enough to use the existing refcounting in the other functions like smc_llc_flow_initiate()? > >Btw: it is interesting what kind of crashes you see, we never met them in our setup. We are trying to using SMC + RDMA to boost application performance, we now have a product in the cloud called ERDMA which can be used in the virtual machine. We are testing SMC with link down/up with short flow cases since in the cloud environment the RDMA device may be plugged in/out frequently, and there are many different applications, some of them may have pretty much short flows. >Its great to see you evaluating SMC in a cloud environment! Thanks! We are trying to use SMC to boost performance for cloud applications, and we hope SMC can be more generic and widely used.