Cluster downtime due to unsynchronized clocks

Mark Schouten <mark@xxxxxxxx> · Thu, 23 Sep 2021 09:49:44 +0200

Hi,

Last night we’ve had downtime on a simple three-node cluster. Here’s 
what happened:
2021-09-23 00:18:48.331528 mon.node01 (mon.0) 834384 : cluster [WRN] 
message from mon.2 was stamped 8.401927s in the future, clocks not 
synchronized
2021-09-23 00:18:57.783437 mon.node01 (mon.0) 834386 : cluster [WRN] 1 
clock skew 8.40163s > max 0.05s
2021-09-23 00:18:57.783486 mon.node01 (mon.0) 834387 : cluster [WRN] 2 
clock skew 8.40146s > max 0.05s
2021-09-23 00:18:59.843444 mon.node01 (mon.0) 834388 : cluster [WRN] 
Health check failed: clock skew detected on mon.node02, mon.node03 
(MON_CLOCK_SKEW)

The cause of this timeshift is the terrible way that systemd-timesyncd 
works, depending on a single NTP-server. If that one is going haywire, 
systemd-timesyncd does not check with others, but just sets the clock on 
your machine incorrect. We will fix this with chrony.

However, what I don’t understand is that why the cluster does not see 
the single monitor as incorrect, but the two correct machines as 
incorrect. Is this because one of the three is master-ish?

Obviously we will fix the time issues, but I would like to understand 
the reasoning of Ceph to stop functioning because one monitor has 
incorrect time.

Thanks!

--
Mark Schouten
CTO, Tuxis B.V. | https://www.tuxis.nl/
<mark@xxxxxxxx> <mailto:mark@xxxxxxxx> | +31 318 200208
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx