Re: Exact purpose of network.ping-timeout

Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> · Wed, 10 Jan 2018 01:17:31 -0500 (EST)



----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> To: "Omar Kohl" <omar.kohl@xxxxxxxxxxxx>
> Cc: gluster-users@xxxxxxxxxxx
> Sent: Wednesday, January 10, 2018 10:56:21 AM
> Subject: Re:  Exact purpose of network.ping-timeout
> 
> Sorry about the delayed response. Had to dig into the history to answer
> various "why"s.
> 
> ----- Original Message -----
> > From: "Omar Kohl" <omar.kohl@xxxxxxxxxxxx>
> > To: gluster-users@xxxxxxxxxxx
> > Sent: Tuesday, December 26, 2017 6:41:48 PM
> > Subject:  Exact purpose of network.ping-timeout
> > 
> > Hi,
> > 
> > I have a question regarding the "ping-timeout" option. I have been
> > researching its purpose for a few days and it is not completely clear to
> > me.
> > Especially that it is apparently strongly encouraged by the Gluster
> > community not to change or at least decrease this value!
> > 
> > Assuming that I set ping-timeout to 10 seconds (instead of the default 42)
> > this would mean that if I have a network outage of 11 seconds then Gluster
> > internally would have to re-allocate some resources that it freed after the
> > 10 seconds, correct? But apart from that there are no negative
> > implications,
> > are there? For instance if I'm copying files during the network outage then
> > those files will continue copying after those 11 seconds.
> > 
> > This means that the only purpose of ping-timeout is to save those extra
> > resources that are used by "short" network outages. Is that correct?
> 
> Basic purpose of ping-timer/heartbeat is to identify an unresponsive brick.
> Unresponsiveness can be caused due to various reasons like:
> * A deadlocked server. We no longer see too many instances of deadlocked
> bricks/server
> * Slow execution of fops in brick stack. For eg.,
>     - due to lock contention. There have been some efforts to fix the lock
>     contention on brick stack.
>     - bad backend OS/filesystem. Posix health checker was an effort to fix
>     this.
>     - Not enough threads for execution etc
>   Note that ideally its not the job of ping framework to identify this
>   scenario and following the same thought process we've shielded the
>   processing of ping requests on bricks from the costs of execution of
>   requests to Glusterfs Program.
> 
> * Ungraceful shutdown of network connections. For eg.,
>     - hard shutdown of machine/container/VM running the brick
>     - physically pulling out the network cable
>   Basically all those different scenarios where TCP/IP doesn't get a chance
>   to inform the other end that it is going down. Note that some of the
>   scenarios of ungraceful network shutdown can be identified using
>   TCP_KEEPALIVE and TCP_USERTIMEOUT [1]. However, at the time when heartbeat
>   mechanism was introduced in Glusterfs, TCP_KEEPALIVE couldn't identify all
>   the ungraceful network shutdown scenarios and TCP_USER_TIMEOUT was yet to
>   be implemented in Linux kernel. One scenario which TCP_KEEPALIVE could

s/could/couldn't/

>   identify was the exact scenario TCP_USER_TIMEOUT aims to solve -
>   identifying an hard network shutdown when data is in transit. However
>   there might be other limitations in TCP_KEEPALIVE which we need to test
>   out before retiring heart beat mechanism in favor of TCP_KEEPALIVE and
>   TCP_USER_TIMEOUT.
> 
> The next interesting question would be why we need to identify an
> unresponsive brick. Various reasons why we need to do that would be:
> * To replace/fix any problems the brick might have
> * Almost all of the cluster translators - DHT, AFR, EC - wait for a response
> from all of their children - either successful or failure - before sending
> the response back to application. This means one or more slow/unresponsive
> brick can increase the latencies of fops/syscalls even though other bricks
> are responsive and healthy. However there are ongoing efforts to minimize
> the effect of few slow/unresponsive bricks [2]. I think principles of [2]
> can applied to DHT and AFR too.
> 
> Some recent discussions on the necessity of ping framework in glusterfs can
> be found at [3].
> 
> Having given all the above reasons for the existence of ping framework, its
> also important that ping-framework shouldn't bring down an otherwise healthy
> connection (False positives). Reasons are:
> * As pointed out by Joe Julian in another mail on this thread, each
> connection carries some state on bricks like locks/open-fds which is cleaned
> up on a disconnect. So, disconnects (even those followed by quick
> reconnects) are not completely transient to application. Though presence of
> HA layers like EC/AFR mitigates this problem to some extent, we still don't
> have a lock healing implementation in place. So, once Quorum number of
> AFR/EC children go down (though may not be all at once), locks are no longer
> held on bricks.
> * All the fops that are in transit in the time window starting from the time
> of disconnect till a successful reconnect are failed by rpc/transport layer.
> So, based on the configuration of volumes (whether AFR/EC/DHT prevent these
> errors from being seen by application), this *may* result in application
> seeing the error.
> 
> IOW, disconnects are not lightweight and we need to avoid them whenever
> possible. Since the action on ping-timer expiry is to disconnect the
> connection, we suggest not have very low values to avoid spurious
> disconnections.
> 
> [1] http://man7.org/linux/man-pages/man7/tcp.7.html
> [2] https://github.com/gluster/glusterfs/issues/366
> [3] http://lists.gluster.org/pipermail/gluster-devel/2017-January/051938.html
> 
> > 
> > If I am confident that my network will not have many 11 second outages and
> > if
> > they do occur I am willing to incur those extra costs due to resource
> > allocation is there any reason not to set ping-timeout to 10 seconds?
> > 
> > The problem I have with a long ping-timeout is that the Windows Samba
> > Client
> > disconnects after 25 seconds. So if one of the nodes of a Gluster cluster
> > shuts down ungracefully then the Samba Client disconnects and the file that
> > was being copied is incomplete on the server. These "costs" seem to be much
> > higher than the potential costs of those Gluster resource re-allocations.
> > But it is hard to estimate because there is not clear documentation what
> > exactly those Gluster costs are.
> > 
> > In general I would be very interested in a comprehensive explanation of
> > ping-timeout and the up- and downsides of setting high or low values for
> > it.
> > 
> > Kinds regards,
> > Omar
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users@xxxxxxxxxxx
> > http://lists.gluster.org/mailman/listinfo/gluster-users
> > 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://lists.gluster.org/mailman/listinfo/gluster-users
> 
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users