Thanks for your reply.
On 2021/12/29 8:51 pm, Karsten Graul wrote:
On 28/12/2021 16:13, Wen Gu wrote:
We encountered some crashes caused by the race between SMC-R
link access and link clear triggered by link group termination
in abnormal case, like port error.
Without to dig deeper into this, there is already a refcount for links, see smc_wr_tx_link_hold().
In smc_wr_free_link() there are waits for the refcounts to become zero.
Thanks for reminding. we also noticed link->wr_tx_refcnt when trying to fix this issue.
In my humble opinions, link->wr_tx_refcnt is used for fixing the race between the waiters for a
tx work request buffer (mainly LLC/CDC msgs) and the link down processing that finally clears the
link.
But the issue in this patch is about the race between the access to link by the connections
above it (like in listen or connect processing) and the link clear processing that memset the link
as zero and release the resource. So it seems that the two should not share the same reference count?
Why do you need to introduce another refcounting instead of using the existing?
And if you have a good reason, do we still need the existing refcounting with your new
implementation?
Yes, we still need it.
In my humble opinion, link->wr_tx_refcnt can ensure that the CDC/LLC message sends won't wait for
an already cleared link. And LLC messages may be triggered by underlying events like net device
ports add/error.
But this patch's implementation only ensures that the access to link by connections is safe and
smc connections won't get something that already freed during its life cycle, like in listen/connect
processing. It can't cover the link access by LLC messages, which may be triggered by underlying
events.
Maybe its enough to use the existing refcounting in the other functions like smc_llc_flow_initiate()?
Btw: it is interesting what kind of crashes you see, we never met them in our setup.
This kind of crashes and the link group free crashes mentioned in the [1/2] patch can be reproduced
by up/down net device frequently during the testing.
Its great to see you evaluating SMC in a cloud environment!
Thanks! Hope that SMC will be widely used. It is an amazing protocal!
Cheers,
Wen Gu