On 2013-11-20T12:34:43, David Teigland <teigland@xxxxxxxxxx> wrote: > > (We can't reconnect while the {src ip, port;dst ip, port} is still > > around.) > I'm not sure, but I think I'm worried about a different problem: messages > are sent through the old connection, a node restarts, a new connection is > quickly created, and the *old messages* are received through the new > connection. I'm a bit unclear how this could happen, or how this could be made worse by this SO_LINGER patch. After all, this makes connection tear down faster - that is, once we've already called close() on the call. That happens after we've gotten a node down event, and fenced it. And: TCP, which apparently restarts faster anyway, already would have this problem. And it doesn't seem to have it. It basically helps if there's traffic stuck in flight while a node crashes. I don't know why that's more likely to happen with SCTP, perhaps because the damn thing is slower, or because it buffers more due to it's ability to resend over different channels ... > The dlm tries to detect and discard stale/old messages, but if they > get through, they can cause problems. I'd like to know whether the > LINGER change could make this more likely. If so, then we may want > this change to be a configuration option. I don't really think it could. > > pretty realistic and, alas, unavoidable to me. You can hit the same by > > powering off the node, too. > Right, I wanted to know it was not *only* the simulation case being > affected. I wish it was. It took the team quite a while to track down. It's really an annoying bug since it can't always be reproduced. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster