mgr's stop responding, dropping out of cluster with _check_auth_rotating

Welby McRoberts <w-ceph-users@xxxxxxxxx> · Thu, 10 Dec 2020 15:43:56 +0000

Hi Folks

We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs with 24
per node) that the mgr's are, after a non specific period of time, dropping
out of the cluster. The logs only show the following:

debug 2020-12-10T02:02:50.409+0000 7f1005840700  0 log_channel(cluster) log
[DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB
used, 6.3 PiB / 6.3 PiB avail
debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2020-12-10T02:20:59.226159+0000)
debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2020-12-10T02:21:00.226310+0000)

The _check_auth_rotating repeats approximately every second. The instances
are all syncing their time with NTP and have no issues on that front. A
restart of the mgr fixes the issue.

It appears that this may be related to https://tracker.ceph.com/issues/39264.
The suggestion seems to be to disable prometheus metrics, however, this
obviously isn't realistic for a production environment where metrics are
critical for operations.

Please let us know what additional information we can provide to assist in
resolving this critical issue.

Cheers
Welby
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx