+gluster-devel ----- Original Message ----- > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > To: "Omar Kohl" <omar.kohl@xxxxxxxxxxxx> > Cc: gluster-users@xxxxxxxxxxx > Sent: Wednesday, January 10, 2018 11:47:31 AM > Subject: Re: Exact purpose of network.ping-timeout > > > > ----- Original Message ----- > > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > > To: "Omar Kohl" <omar.kohl@xxxxxxxxxxxx> > > Cc: gluster-users@xxxxxxxxxxx > > Sent: Wednesday, January 10, 2018 10:56:21 AM > > Subject: Re: Exact purpose of network.ping-timeout > > > > Sorry about the delayed response. Had to dig into the history to answer > > various "why"s. > > > > ----- Original Message ----- > > > From: "Omar Kohl" <omar.kohl@xxxxxxxxxxxx> > > > To: gluster-users@xxxxxxxxxxx > > > Sent: Tuesday, December 26, 2017 6:41:48 PM > > > Subject: Exact purpose of network.ping-timeout > > > > > > Hi, > > > > > > I have a question regarding the "ping-timeout" option. I have been > > > researching its purpose for a few days and it is not completely clear to > > > me. > > > Especially that it is apparently strongly encouraged by the Gluster > > > community not to change or at least decrease this value! > > > > > > Assuming that I set ping-timeout to 10 seconds (instead of the default > > > 42) > > > this would mean that if I have a network outage of 11 seconds then > > > Gluster > > > internally would have to re-allocate some resources that it freed after > > > the > > > 10 seconds, correct? But apart from that there are no negative > > > implications, > > > are there? For instance if I'm copying files during the network outage > > > then > > > those files will continue copying after those 11 seconds. > > > > > > This means that the only purpose of ping-timeout is to save those extra > > > resources that are used by "short" network outages. Is that correct? > > > > Basic purpose of ping-timer/heartbeat is to identify an unresponsive brick. > > Unresponsiveness can be caused due to various reasons like: > > * A deadlocked server. We no longer see too many instances of deadlocked > > bricks/server > > * Slow execution of fops in brick stack. For eg., > > - due to lock contention. There have been some efforts to fix the lock > > contention on brick stack. > > - bad backend OS/filesystem. Posix health checker was an effort to fix > > this. > > - Not enough threads for execution etc > > Note that ideally its not the job of ping framework to identify this > > scenario and following the same thought process we've shielded the > > processing of ping requests on bricks from the costs of execution of > > requests to Glusterfs Program. > > > > * Ungraceful shutdown of network connections. For eg., > > - hard shutdown of machine/container/VM running the brick > > - physically pulling out the network cable > > Basically all those different scenarios where TCP/IP doesn't get a chance > > to inform the other end that it is going down. Note that some of the > > scenarios of ungraceful network shutdown can be identified using > > TCP_KEEPALIVE and TCP_USERTIMEOUT [1]. However, at the time when > > heartbeat > > mechanism was introduced in Glusterfs, TCP_KEEPALIVE couldn't identify > > all > > the ungraceful network shutdown scenarios and TCP_USER_TIMEOUT was yet to > > be implemented in Linux kernel. One scenario which TCP_KEEPALIVE could > > s/could/couldn't/ > > > identify was the exact scenario TCP_USER_TIMEOUT aims to solve - > > identifying an hard network shutdown when data is in transit. However > > there might be other limitations in TCP_KEEPALIVE which we need to test > > out before retiring heart beat mechanism in favor of TCP_KEEPALIVE and > > TCP_USER_TIMEOUT. > > > > The next interesting question would be why we need to identify an > > unresponsive brick. Various reasons why we need to do that would be: > > * To replace/fix any problems the brick might have > > * Almost all of the cluster translators - DHT, AFR, EC - wait for a > > response > > from all of their children - either successful or failure - before sending > > the response back to application. This means one or more slow/unresponsive > > brick can increase the latencies of fops/syscalls even though other bricks > > are responsive and healthy. However there are ongoing efforts to minimize > > the effect of few slow/unresponsive bricks [2]. I think principles of [2] > > can applied to DHT and AFR too. > > > > Some recent discussions on the necessity of ping framework in glusterfs can > > be found at [3]. > > > > Having given all the above reasons for the existence of ping framework, its > > also important that ping-framework shouldn't bring down an otherwise > > healthy > > connection (False positives). Reasons are: > > * As pointed out by Joe Julian in another mail on this thread, each > > connection carries some state on bricks like locks/open-fds which is > > cleaned > > up on a disconnect. So, disconnects (even those followed by quick > > reconnects) are not completely transient to application. Though presence of > > HA layers like EC/AFR mitigates this problem to some extent, we still don't > > have a lock healing implementation in place. So, once Quorum number of > > AFR/EC children go down (though may not be all at once), locks are no > > longer > > held on bricks. > > * All the fops that are in transit in the time window starting from the > > time > > of disconnect till a successful reconnect are failed by rpc/transport > > layer. > > So, based on the configuration of volumes (whether AFR/EC/DHT prevent these > > errors from being seen by application), this *may* result in application > > seeing the error. > > > > IOW, disconnects are not lightweight and we need to avoid them whenever > > possible. Since the action on ping-timer expiry is to disconnect the > > connection, we suggest not have very low values to avoid spurious > > disconnections. I forgot to touch upon why we disconnect the transport on ping-timer expiry. To answer this we need to go back on the necessity of identifying unresponsive bricks. One requirement was to fail on-going operations on the unresponsive brick, so that syscall from application completes. Disconnecting connections/transports is a cleaner way of doing this as on-going fops on a connection are maintained in its state and on disconnection these stored fops are failed. Also Once disconnected, bricks won't be able to submit a response to a fop which the client has already failed as the connection is broken and hence the problem of duplicate responses won't be there. > > > > [1] http://man7.org/linux/man-pages/man7/tcp.7.html > > [2] https://github.com/gluster/glusterfs/issues/366 > > [3] > > http://lists.gluster.org/pipermail/gluster-devel/2017-January/051938.html > > > > > > > > If I am confident that my network will not have many 11 second outages > > > and > > > if > > > they do occur I am willing to incur those extra costs due to resource > > > allocation is there any reason not to set ping-timeout to 10 seconds? > > > > > > The problem I have with a long ping-timeout is that the Windows Samba > > > Client > > > disconnects after 25 seconds. So if one of the nodes of a Gluster cluster > > > shuts down ungracefully then the Samba Client disconnects and the file > > > that > > > was being copied is incomplete on the server. These "costs" seem to be > > > much > > > higher than the potential costs of those Gluster resource re-allocations. > > > But it is hard to estimate because there is not clear documentation what > > > exactly those Gluster costs are. > > > > > > In general I would be very interested in a comprehensive explanation of > > > ping-timeout and the up- and downsides of setting high or low values for > > > it. > > > > > > Kinds regards, > > > Omar > > > _______________________________________________ > > > Gluster-users mailing list > > > Gluster-users@xxxxxxxxxxx > > > http://lists.gluster.org/mailman/listinfo/gluster-users > > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users@xxxxxxxxxxx > > http://lists.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users