Re: NTP time steps causes cluster reconfiguration

"Martin Waite" <Martin.Waite@xxxxxxxxxxxx> · Fri, 16 Jul 2010 16:11:38 +0100

Hi,

NTP has a step-threshold - if the time difference is greater than the
threshold, it will step the time rather than speeding it up or down.  So
even using ntpd can cause clock steps (especially in our test
environment where our crappy overloaded NTP servers sometimes lose 30
seconds).

On some VMware test hosts, I did manage to make the cluster fence some
nodes through changing the time backwards and forwards, but I could not
reproduce the effect on physical hosts.     I was hoping that the
fencing was caused by a combination of clock changes and VM guest timing
flakiness, but from your description, it sounds like this might be a
real risk on physical servers too.

I had better do some more testing.

Thanks for the input.

regards,
Martin

> -----Original Message-----
> From: linux-cluster-bounces@xxxxxxxxxx
[mailto:linux-cluster-bounces@xxxxxxxxxx]
> On Behalf Of Kaloyan Kovachev
> Sent: 16 July 2010 15:36
> To: linux clustering
> Subject: Re:  NTP time steps causes cluster
reconfiguration
> 
> Hi,
>  i can confirm, that time steps do cause reconfiguration. Not sure if
this
> was the reason, but one of my nodes was fenced from time to time
> (previously) after several reconfigurations and also it caused some
> problems with gfs being withdrawn.
>  ntpdate running as cron job does step changes, but ntpd should not
cause
> step changes. It should instead speed-up or slow-down the clock until
it is
> synchronized. However using the -g option you may ask that the clock
jumps
> once at the start of ntpd.
>  I have configured all cluster nodes to synchronize from each other
via
> ntpd (configured as peers) and each from one (different) additional
> (startum 1 or 2) source as server. Since then i don't see
reconfiguration
> in the logs.
> 
> On Fri, 16 Jul 2010 14:18:22 +0100, "Martin Waite"
> <Martin.Waite@xxxxxxxxxxxx> wrote:
> > Hi,
> >
> >
> >
> > During testing, I noticed that a time step caused by ntpd caused the
> > cluster to drop into GATHER state:
> >
> >
> >
> > Jun 16 12:13:16 cp1edidbm001 ntpd[30917]: time reset -16.332117 s
> >
> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering GATHER
> > state from 12.
> >
> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Creating commit
> > token because I am the rep.
> >
> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Saving state
aru 9e
> > high seq received 9e
> >
> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Storing new
> > sequence id for ring 328
> >
> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering COMMIT
> > state.
> >
> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering
RECOVERY
> > state.
> >
> > ...
> >
> >
> >
> > This is easily repeatable through setting the clock forwards by 20
> > seconds using /bin/date.  This probably causes comms timeouts to
expire
> > prematurely, and almost every time causes the cluster to reconfigure
-
> > luckily without affecting running services.
> >
> >
> >
> > Stepping the clock backwards also causes a similar disruption, but
there
> > is a long lag between changing the time and the cluster
reconfiguring:
> > perhaps this extends a timeout or sleep on the affected node,
causing
> > genuine timeouts on the other nodes.
> >
> >
> >
> > All I am looking for is some reassurance that clock changes are not
> > going to crash the cluster.  Is anyone able to confirm this please ?
> >
> >
> >
> > regards,
> >
> > Martin
> 
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster