Hi, NTP has a step-threshold - if the time difference is greater than the threshold, it will step the time rather than speeding it up or down. So even using ntpd can cause clock steps (especially in our test environment where our crappy overloaded NTP servers sometimes lose 30 seconds). On some VMware test hosts, I did manage to make the cluster fence some nodes through changing the time backwards and forwards, but I could not reproduce the effect on physical hosts. I was hoping that the fencing was caused by a combination of clock changes and VM guest timing flakiness, but from your description, it sounds like this might be a real risk on physical servers too. I had better do some more testing. Thanks for the input. regards, Martin > -----Original Message----- > From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] > On Behalf Of Kaloyan Kovachev > Sent: 16 July 2010 15:36 > To: linux clustering > Subject: Re: NTP time steps causes cluster reconfiguration > > Hi, > i can confirm, that time steps do cause reconfiguration. Not sure if this > was the reason, but one of my nodes was fenced from time to time > (previously) after several reconfigurations and also it caused some > problems with gfs being withdrawn. > ntpdate running as cron job does step changes, but ntpd should not cause > step changes. It should instead speed-up or slow-down the clock until it is > synchronized. However using the -g option you may ask that the clock jumps > once at the start of ntpd. > I have configured all cluster nodes to synchronize from each other via > ntpd (configured as peers) and each from one (different) additional > (startum 1 or 2) source as server. Since then i don't see reconfiguration > in the logs. > > On Fri, 16 Jul 2010 14:18:22 +0100, "Martin Waite" > <Martin.Waite@xxxxxxxxxxxx> wrote: > > Hi, > > > > > > > > During testing, I noticed that a time step caused by ntpd caused the > > cluster to drop into GATHER state: > > > > > > > > Jun 16 12:13:16 cp1edidbm001 ntpd[30917]: time reset -16.332117 s > > > > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering GATHER > > state from 12. > > > > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Creating commit > > token because I am the rep. > > > > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Saving state aru 9e > > high seq received 9e > > > > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Storing new > > sequence id for ring 328 > > > > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering COMMIT > > state. > > > > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering RECOVERY > > state. > > > > ... > > > > > > > > This is easily repeatable through setting the clock forwards by 20 > > seconds using /bin/date. This probably causes comms timeouts to expire > > prematurely, and almost every time causes the cluster to reconfigure - > > luckily without affecting running services. > > > > > > > > Stepping the clock backwards also causes a similar disruption, but there > > is a long lag between changing the time and the cluster reconfiguring: > > perhaps this extends a timeout or sleep on the affected node, causing > > genuine timeouts on the other nodes. > > > > > > > > All I am looking for is some reassurance that clock changes are not > > going to crash the cluster. Is anyone able to confirm this please ? > > > > > > > > regards, > > > > Martin > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster