Re: Cluster downtime due to unsynchronized clocks

胡玮文 <huww98@xxxxxxxxxxx> · Thu, 23 Sep 2021 08:14:12 +0000

> 在 2021年9月23日，15:50，Mark Schouten <mark@xxxxxxxx> 写道：
>
> Hi,
>
> Last night we’ve had downtime on a simple three-node cluster. Here’s
> what happened:
> 2021-09-23 00:18:48.331528 mon.node01 (mon.0) 834384 : cluster [WRN]
> message from mon.2 was stamped 8.401927s in the future, clocks not
> synchronized
> 2021-09-23 00:18:57.783437 mon.node01 (mon.0) 834386 : cluster [WRN] 1
> clock skew 8.40163s > max 0.05s
> 2021-09-23 00:18:57.783486 mon.node01 (mon.0) 834387 : cluster [WRN] 2
> clock skew 8.40146s > max 0.05s
> 2021-09-23 00:18:59.843444 mon.node01 (mon.0) 834388 : cluster [WRN]
> Health check failed: clock skew detected on mon.node02, mon.node03
> (MON_CLOCK_SKEW)
>
> The cause of this timeshift is the terrible way that systemd-timesyncd
> works, depending on a single NTP-server. If that one is going haywire,
> systemd-timesyncd does not check with others, but just sets the clock on
> your machine incorrect. We will fix this with chrony.
>
> However, what I don’t understand is that why the cluster does not see
> the single monitor as incorrect, but the two correct machines as
> incorrect. Is this because one of the three is master-ish?

I believe yes. “ceph mon stat” will tell you which one is the leader.

> Obviously we will fix the time issues, but I would like to understand
> the reasoning of Ceph to stop functioning because one monitor has
> incorrect time.
>
> Thanks!
>
> --
> Mark Schouten
> CTO, Tuxis B.V. | https://www.tuxis.nl/
> <mark@xxxxxxxx> <mailto:mark@xxxxxxxx> | +31 318 200208
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx