--- On Tue, 12/30/08, Basavanagowda Kanur <gowda@xxxxxxxxxxxxx> wrote: > If server is down for transport-timout time, then client > returns all the calls with 'Transport Endpoint not connected' > error. Yes, this is exactly what I do not want. I want reads/writes to simply block when the server is down and to complete (the blocked calls) when the server returns. I do not want my applications to get an error, only a delay. Without this it is not possible to recover gracefully from a server/network failure. While we are at it, what is the timeout in, seconds, milliseconds? I have been trying to understand what it would take to implement this feature in the client protocol translator. At first thought, it seems like there are two main cases that would need to be dealt with, 1) requests which have not yet hit the wire, and fail when they attempt to, and 2) requests which have already hit the wire but have not been responded to by the server. Possibly a third more complex case 3) would be requests which hit the server and were responded to, but the response was never received by the client. The simplest case seems to be # 1, simply wait for the connection to reestablish itself and retry to submit the protocol to the wire. I hacked a simple implementation of this (looping in protocol_client_xfer until the connection is reestablished without holding the lock) which seems to work, but I have no clue if it is correct. ;) I will attach it below. For # 2, it looks like the client protocol keeps a list of outstanding requests in the saved_frames list. Is there any reason this list could not be resubmitted when the connection is reestablished instead of it being purged when the connection fails (apart from the problems associated with corner case #3)? Is all the required data still in the frame at this point (before protocol_client_cleanup is called)? Corner case # 3 seems like it would require the server to keep track of responses it knows did not reach the client. If it can resend these responses to the client when the connection is reestablished, the client could process those requests without resending them. This is my simplistic understanding of the problem. Am I overlooking something major that would prevent this from working? Is this something you would consider implementing or accepting patches for if I can get it to work (although it might be way beyond my abilities)? Am I way off and wasting my time? :( Thanks, -Martin --- xlators/protocol/client/src/client-protocol.c 2008-12-30 17 :24:34.000000000 -0700 +++ xlators/protocol/client/src/client-protocol.c.orig 2008-12-30 13 :23:26.000000000 -0700 @@ -388,7 +388,6 @@ gf_hdr_common_t rsphdr = {0, }; client_forget_t forget = {0, }; uint8_t send_forget = 0; - uint8_t reconnect = 1; priv = this->private; trans = priv->transport; @@ -431,32 +430,14 @@ hdr->req.pid = hton32 (frame->root->pid); } - if(type == GF_OP_TYPE_MOP_REQUEST && - op == GF_MOP_SETVOLUME) - reconnect = 0; - - while(1) { - if (cprivate->connected == 0) - transport_connect (trans); - - if (cprivate->connected || - ((type == GF_OP_TYPE_MOP_REQUEST) && - (op == GF_MOP_SETVOLUME))) { - ret = transport_submit (trans, (char *)hdr, hdrlen, - vector, count, refs); - } - - if (!reconnect || ret >= 0 || cprivate->connected > 0) - break; + if (cprivate->connected == 0) + transport_connect (trans); - pthread_mutex_unlock (&cprivate->lock); - while (cprivate->connected <= 0) { - gf_log (this->name, GF_LOG_DEBUG, - "protocol_client_xfer waiting for connection(%i)", - cprivate->connected); - sleep(1); - } - pthread_mutex_lock (&cprivate->lock); + if (cprivate->connected || + ((type == GF_OP_TYPE_MOP_REQUEST) && + (op == GF_MOP_SETVOLUME))) { + ret = transport_submit (trans, (char *)hdr, hdrlen, + vector, count, refs); } if ((ret >= 0) && frame) {