----- Original Message ----- > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > To: "Omar Kohl" <omar.kohl@xxxxxxxxxxxx> > Cc: gluster-users@xxxxxxxxxxx > Sent: Wednesday, January 10, 2018 10:56:21 AM > Subject: Re: Exact purpose of network.ping-timeout > > Sorry about the delayed response. Had to dig into the history to answer > various "why"s. > > ----- Original Message ----- > > From: "Omar Kohl" <omar.kohl@xxxxxxxxxxxx> > > To: gluster-users@xxxxxxxxxxx > > Sent: Tuesday, December 26, 2017 6:41:48 PM > > Subject: Exact purpose of network.ping-timeout > > > > Hi, > > > > I have a question regarding the "ping-timeout" option. I have been > > researching its purpose for a few days and it is not completely clear to > > me. > > Especially that it is apparently strongly encouraged by the Gluster > > community not to change or at least decrease this value! > > > > Assuming that I set ping-timeout to 10 seconds (instead of the default 42) > > this would mean that if I have a network outage of 11 seconds then Gluster > > internally would have to re-allocate some resources that it freed after the > > 10 seconds, correct? But apart from that there are no negative > > implications, > > are there? For instance if I'm copying files during the network outage then > > those files will continue copying after those 11 seconds. > > > > This means that the only purpose of ping-timeout is to save those extra > > resources that are used by "short" network outages. Is that correct? > > Basic purpose of ping-timer/heartbeat is to identify an unresponsive brick. > Unresponsiveness can be caused due to various reasons like: > * A deadlocked server. We no longer see too many instances of deadlocked > bricks/server > * Slow execution of fops in brick stack. For eg., > - due to lock contention. There have been some efforts to fix the lock > contention on brick stack. > - bad backend OS/filesystem. Posix health checker was an effort to fix > this. > - Not enough threads for execution etc > Note that ideally its not the job of ping framework to identify this > scenario and following the same thought process we've shielded the > processing of ping requests on bricks from the costs of execution of > requests to Glusterfs Program. > > * Ungraceful shutdown of network connections. For eg., > - hard shutdown of machine/container/VM running the brick > - physically pulling out the network cable > Basically all those different scenarios where TCP/IP doesn't get a chance > to inform the other end that it is going down. Note that some of the > scenarios of ungraceful network shutdown can be identified using > TCP_KEEPALIVE and TCP_USERTIMEOUT [1]. However, at the time when heartbeat > mechanism was introduced in Glusterfs, TCP_KEEPALIVE couldn't identify all > the ungraceful network shutdown scenarios and TCP_USER_TIMEOUT was yet to > be implemented in Linux kernel. One scenario which TCP_KEEPALIVE could s/could/couldn't/ > identify was the exact scenario TCP_USER_TIMEOUT aims to solve - > identifying an hard network shutdown when data is in transit. However > there might be other limitations in TCP_KEEPALIVE which we need to test > out before retiring heart beat mechanism in favor of TCP_KEEPALIVE and > TCP_USER_TIMEOUT. > > The next interesting question would be why we need to identify an > unresponsive brick. Various reasons why we need to do that would be: > * To replace/fix any problems the brick might have > * Almost all of the cluster translators - DHT, AFR, EC - wait for a response > from all of their children - either successful or failure - before sending > the response back to application. This means one or more slow/unresponsive > brick can increase the latencies of fops/syscalls even though other bricks > are responsive and healthy. However there are ongoing efforts to minimize > the effect of few slow/unresponsive bricks [2]. I think principles of [2] > can applied to DHT and AFR too. > > Some recent discussions on the necessity of ping framework in glusterfs can > be found at [3]. > > Having given all the above reasons for the existence of ping framework, its > also important that ping-framework shouldn't bring down an otherwise healthy > connection (False positives). Reasons are: > * As pointed out by Joe Julian in another mail on this thread, each > connection carries some state on bricks like locks/open-fds which is cleaned > up on a disconnect. So, disconnects (even those followed by quick > reconnects) are not completely transient to application. Though presence of > HA layers like EC/AFR mitigates this problem to some extent, we still don't > have a lock healing implementation in place. So, once Quorum number of > AFR/EC children go down (though may not be all at once), locks are no longer > held on bricks. > * All the fops that are in transit in the time window starting from the time > of disconnect till a successful reconnect are failed by rpc/transport layer. > So, based on the configuration of volumes (whether AFR/EC/DHT prevent these > errors from being seen by application), this *may* result in application > seeing the error. > > IOW, disconnects are not lightweight and we need to avoid them whenever > possible. Since the action on ping-timer expiry is to disconnect the > connection, we suggest not have very low values to avoid spurious > disconnections. > > [1] http://man7.org/linux/man-pages/man7/tcp.7.html > [2] https://github.com/gluster/glusterfs/issues/366 > [3] http://lists.gluster.org/pipermail/gluster-devel/2017-January/051938.html > > > > > If I am confident that my network will not have many 11 second outages and > > if > > they do occur I am willing to incur those extra costs due to resource > > allocation is there any reason not to set ping-timeout to 10 seconds? > > > > The problem I have with a long ping-timeout is that the Windows Samba > > Client > > disconnects after 25 seconds. So if one of the nodes of a Gluster cluster > > shuts down ungracefully then the Samba Client disconnects and the file that > > was being copied is incomplete on the server. These "costs" seem to be much > > higher than the potential costs of those Gluster resource re-allocations. > > But it is hard to estimate because there is not clear documentation what > > exactly those Gluster costs are. > > > > In general I would be very interested in a comprehensive explanation of > > ping-timeout and the up- and downsides of setting high or low values for > > it. > > > > Kinds regards, > > Omar > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users@xxxxxxxxxxx > > http://lists.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > http://lists.gluster.org/mailman/listinfo/gluster-users > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users