Re: Exact purpose of network.ping-timeout

Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> · Wed, 10 Jan 2018 23:20:16 -0500 (EST)

+gluster-devel

----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> To: "Omar Kohl" <omar.kohl@xxxxxxxxxxxx>
> Cc: gluster-users@xxxxxxxxxxx
> Sent: Wednesday, January 10, 2018 11:47:31 AM
> Subject: Re:  Exact purpose of network.ping-timeout
> 
> 
> 
> ----- Original Message -----
> > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> > To: "Omar Kohl" <omar.kohl@xxxxxxxxxxxx>
> > Cc: gluster-users@xxxxxxxxxxx
> > Sent: Wednesday, January 10, 2018 10:56:21 AM
> > Subject: Re:  Exact purpose of network.ping-timeout
> > 
> > Sorry about the delayed response. Had to dig into the history to answer
> > various "why"s.
> > 
> > ----- Original Message -----
> > > From: "Omar Kohl" <omar.kohl@xxxxxxxxxxxx>
> > > To: gluster-users@xxxxxxxxxxx
> > > Sent: Tuesday, December 26, 2017 6:41:48 PM
> > > Subject:  Exact purpose of network.ping-timeout
> > > 
> > > Hi,
> > > 
> > > I have a question regarding the "ping-timeout" option. I have been
> > > researching its purpose for a few days and it is not completely clear to
> > > me.
> > > Especially that it is apparently strongly encouraged by the Gluster
> > > community not to change or at least decrease this value!
> > > 
> > > Assuming that I set ping-timeout to 10 seconds (instead of the default
> > > 42)
> > > this would mean that if I have a network outage of 11 seconds then
> > > Gluster
> > > internally would have to re-allocate some resources that it freed after
> > > the
> > > 10 seconds, correct? But apart from that there are no negative
> > > implications,
> > > are there? For instance if I'm copying files during the network outage
> > > then
> > > those files will continue copying after those 11 seconds.
> > > 
> > > This means that the only purpose of ping-timeout is to save those extra
> > > resources that are used by "short" network outages. Is that correct?
> > 
> > Basic purpose of ping-timer/heartbeat is to identify an unresponsive brick.
> > Unresponsiveness can be caused due to various reasons like:
> > * A deadlocked server. We no longer see too many instances of deadlocked
> > bricks/server
> > * Slow execution of fops in brick stack. For eg.,
> >     - due to lock contention. There have been some efforts to fix the lock
> >     contention on brick stack.
> >     - bad backend OS/filesystem. Posix health checker was an effort to fix
> >     this.
> >     - Not enough threads for execution etc
> >   Note that ideally its not the job of ping framework to identify this
> >   scenario and following the same thought process we've shielded the
> >   processing of ping requests on bricks from the costs of execution of
> >   requests to Glusterfs Program.
> > 
> > * Ungraceful shutdown of network connections. For eg.,
> >     - hard shutdown of machine/container/VM running the brick
> >     - physically pulling out the network cable
> >   Basically all those different scenarios where TCP/IP doesn't get a chance
> >   to inform the other end that it is going down. Note that some of the
> >   scenarios of ungraceful network shutdown can be identified using
> >   TCP_KEEPALIVE and TCP_USERTIMEOUT [1]. However, at the time when
> >   heartbeat
> >   mechanism was introduced in Glusterfs, TCP_KEEPALIVE couldn't identify
> >   all
> >   the ungraceful network shutdown scenarios and TCP_USER_TIMEOUT was yet to
> >   be implemented in Linux kernel. One scenario which TCP_KEEPALIVE could
> 
> s/could/couldn't/
> 
> >   identify was the exact scenario TCP_USER_TIMEOUT aims to solve -
> >   identifying an hard network shutdown when data is in transit. However
> >   there might be other limitations in TCP_KEEPALIVE which we need to test
> >   out before retiring heart beat mechanism in favor of TCP_KEEPALIVE and
> >   TCP_USER_TIMEOUT.
> > 
> > The next interesting question would be why we need to identify an
> > unresponsive brick. Various reasons why we need to do that would be:
> > * To replace/fix any problems the brick might have
> > * Almost all of the cluster translators - DHT, AFR, EC - wait for a
> > response
> > from all of their children - either successful or failure - before sending
> > the response back to application. This means one or more slow/unresponsive
> > brick can increase the latencies of fops/syscalls even though other bricks
> > are responsive and healthy. However there are ongoing efforts to minimize
> > the effect of few slow/unresponsive bricks [2]. I think principles of [2]
> > can applied to DHT and AFR too.
> > 
> > Some recent discussions on the necessity of ping framework in glusterfs can
> > be found at [3].
> > 
> > Having given all the above reasons for the existence of ping framework, its
> > also important that ping-framework shouldn't bring down an otherwise
> > healthy
> > connection (False positives). Reasons are:
> > * As pointed out by Joe Julian in another mail on this thread, each
> > connection carries some state on bricks like locks/open-fds which is
> > cleaned
> > up on a disconnect. So, disconnects (even those followed by quick
> > reconnects) are not completely transient to application. Though presence of
> > HA layers like EC/AFR mitigates this problem to some extent, we still don't
> > have a lock healing implementation in place. So, once Quorum number of
> > AFR/EC children go down (though may not be all at once), locks are no
> > longer
> > held on bricks.
> > * All the fops that are in transit in the time window starting from the
> > time
> > of disconnect till a successful reconnect are failed by rpc/transport
> > layer.
> > So, based on the configuration of volumes (whether AFR/EC/DHT prevent these
> > errors from being seen by application), this *may* result in application
> > seeing the error.
> > 
> > IOW, disconnects are not lightweight and we need to avoid them whenever
> > possible. Since the action on ping-timer expiry is to disconnect the
> > connection, we suggest not have very low values to avoid spurious
> > disconnections.

I forgot to touch upon why we disconnect the transport on ping-timer expiry. To answer this we need to go back on the necessity of identifying unresponsive bricks. One requirement was to fail on-going operations on the unresponsive brick, so that syscall from application completes. Disconnecting connections/transports is a cleaner way of doing this as on-going fops on a connection are maintained in its state and on disconnection these stored fops are failed. Also Once disconnected, bricks won't be able to submit a response to a fop which the client has already failed as the connection is broken and hence the problem of duplicate responses won't be there.

> > 
> > [1] http://man7.org/linux/man-pages/man7/tcp.7.html
> > [2] https://github.com/gluster/glusterfs/issues/366
> > [3]
> > http://lists.gluster.org/pipermail/gluster-devel/2017-January/051938.html
> > 
> > > 
> > > If I am confident that my network will not have many 11 second outages
> > > and
> > > if
> > > they do occur I am willing to incur those extra costs due to resource
> > > allocation is there any reason not to set ping-timeout to 10 seconds?
> > > 
> > > The problem I have with a long ping-timeout is that the Windows Samba
> > > Client
> > > disconnects after 25 seconds. So if one of the nodes of a Gluster cluster
> > > shuts down ungracefully then the Samba Client disconnects and the file
> > > that
> > > was being copied is incomplete on the server. These "costs" seem to be
> > > much
> > > higher than the potential costs of those Gluster resource re-allocations.
> > > But it is hard to estimate because there is not clear documentation what
> > > exactly those Gluster costs are.
> > > 
> > > In general I would be very interested in a comprehensive explanation of
> > > ping-timeout and the up- and downsides of setting high or low values for
> > > it.
> > > 
> > > Kinds regards,
> > > Omar
> > > _______________________________________________
> > > Gluster-users mailing list
> > > Gluster-users@xxxxxxxxxxx
> > > http://lists.gluster.org/mailman/listinfo/gluster-users
> > > 
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users@xxxxxxxxxxx
> > http://lists.gluster.org/mailman/listinfo/gluster-users
> > 
> 
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users