Re: mgr's stop responding, dropping out of cluster with _check_auth_rotating

Wido den Hollander <wido@xxxxxxxx> · Fri, 11 Dec 2020 15:10:31 +0100

On 11/12/2020 00:12, David Orman wrote:
Hi Janek,

We realize this, we referenced that issue in our initial email. We do want
the metrics exposed by Ceph internally, and would prefer to work towards a
fix upstream. We appreciate the suggestion for a workaround, however!

Again, we're happy to provide whatever information we can that would be of
assistance. If there's some debug setting that is preferred, we are happy
to implement it, as this is currently a test cluster for us to work through
issues such as this one.

Have you tried disabling Prometheus just to see if this also fixes the 
issue for you?

Wido

David

On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff <
janek.bevendorff@xxxxxxxxxxxxx> wrote:

Do you have the prometheus module enabled? Turn that off, it's causing
issues. I replaced it with another ceph exporter from Github and almost
forgot about it.

Here's the relevant issue report:
https://tracker.ceph.com/issues/39264#change-179946

On 10/12/2020 16:43, Welby McRoberts wrote:
Hi Folks

We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs with
24
per node) that the mgr's are, after a non specific period of time,
dropping
out of the cluster. The logs only show the following:

debug 2020-12-10T02:02:50.409+0000 7f1005840700  0 log_channel(cluster)
log
[DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB
used, 6.3 PiB / 6.3 PiB avail
debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2020-12-10T02:20:59.226159+0000)
debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2020-12-10T02:21:00.226310+0000)

The _check_auth_rotating repeats approximately every second. The
instances
are all syncing their time with NTP and have no issues on that front. A
restart of the mgr fixes the issue.

It appears that this may be related to
https://tracker.ceph.com/issues/39264.
The suggestion seems to be to disable prometheus metrics, however, this
obviously isn't realistic for a production environment where metrics are
critical for operations.

Please let us know what additional information we can provide to assist
in
resolving this critical issue.

Cheers
Welby
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx