On Thu, Oct 14, 2010 at 3:26 AM, Mark Lehrer <mark@xxxxxxx> wrote: ... > > The simplest would be for the clients to reconnect to the new server and > re-establish communications. This is how almost all initiators work today, and it is quite transparent to applications. As long as the failover to the "new" server is reasonably quick and completes before the initiators run out of writeback cache. Once you get "Delayed Write Failed" dialogs, that is when the pain starts. For blocking reads, applications only see ms to a few seconds pauses during the session failure and reconnect. ( If you use ctdb for failover and its "tcptickle" feature, this should be able to shortcircuit any tcp sessions hung inside tcp retransmission timeouts and speed up recovery. Often during recovery of cluster applications, after failover it is often very common that clients are stuck inside tcp retransmission backoffs sitting for 10-20-40 serconds before tcp will detect the session failure, which greatly increases the recovery time. there are certain tricks in tcp that ctdb uses, and other applications could too, to shourtcircuit the the client timeouts and trigger session recovery to happen immediately. I.e. most time spent paused during failover is actually not spent in failover at all but rather waiting for the tcp stack on the clients to detect the session failure. Tcp retransmission backoff is not your friend here. This can usually be short-circuited from the server by clever tcp hacks. ) > However, how painful would it be for the new > server to keep the same sockets open for a truly seamless failover? Again, > I am only concerned about the tgtd internal states at this point - assume > that the block device mirroring as well as the > keepalived/heartbeat/iptables/fencing/etc issues are handled already (though > there would obviously be a good bit of integration work there!). Keeping application state and kernel state (tcp state) is horribly complex and difficult and make this into transparent failover is very hard. I personally do not think that is required for the iscsi protocol since *) there is so little state required in iscsi *) all initiators quickly reconnect and quickly rebuild all required state in almost all situations anyway. regards ronnie sahlberg -- To unsubscribe from this list: send the line "unsubscribe stgt" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html