Re: Ceph not warning about clock skew on an OSD-only host?

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Wed, 12 Aug 2020 16:09:12 -0700

My understanding is that the existing mon_clock_drift_allowed value of 50 ms (default) is so that PAXOS among the mon quorum can function.  So OSDs (and mgrs, and clients etc) are out of scope of that existing code.

Things like this are why I like to ensure that the OS does `ntpdate -b` or equivalent at boot-time *before* starting ntpd / chrony - and other daemons.

Now, as to why Ceph doesn’t have analogous code to acomplain about other daemons / clients - I’ve wonder that for some time myself.  Perhaps there’s the idea that one’s monitoring infrastructure should detect that, but that’s a guess.

> Yesterday, one of our OSD-only hosts came up with its clock about 8 hours wrong(!) having been out of the cluster for a week or so. Initially, ceph seemed entirely happy, and then after an hour or so it all went South (OSDs start logging about bad authenticators, I/O pauses, general sadness).
> 
> I know clock sync is important to Ceph, so "one system is 8 hours out, Ceph becomes sad" is not a surprise. It is perhaps a surprise that the OSDs were allowed in at all...
> 
> What _is_ a surprise, though, is that at no point in all this did Ceph raise a peep about clock skew. Normally it's pretty sensitive to this - our test cluster has had clock skew complaints when a mon is only slightly out, and here we had a node 8 hours wrong.
> 
> Is there some oddity like Ceph not warning on clock skew for OSD-only hosts? or an upper bound on how high a discrepency it will WARN about?
> 
> Regards,
> 
> Matthew
> 
> example output from mid-outage:
> 
> root@sto-3-1:~#  ceph -s
>  cluster:
>    id:     049fc780-8998-45a8-be12-d3b8b6f30e69
>    health: HEALTH_ERR
>            40755436/2702185683 objects misplaced (1.508%)
>            Reduced data availability: 20 pgs inactive, 20 pgs peering
>            Degraded data redundancy: 367431/2702185683 objects degraded (0.014%), 4549 pgs degraded
>            481 slow requests are blocked > 32 sec. Implicated osds 188,284,795,1278,1981,2061,2648,2697
>            644 stuck requests are blocked > 4096 sec. Implicated osds 22,31,33,35,101,116,120,130,132,140,150,159,201,211,228,263,327,541,561,566,585,589,636,643,649,654,743,785,790,806,865,1037,1040,1090,1100,1104,1115,1134,1135,1166,1193,1275,1277,1292,1494,1523,1598,1638,1746,2055,2069,2191,2210,2358,2399,2486,2487,2562,2589,2613,2627,2656,2713,2720,2837,2839,2863,2888,2908,2920,2928,2929,2947,2948,2963,2969,2972
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx