Re: mgr's stop responding, dropping out of cluster with _check_auth_rotating

David Orman <ormandj@xxxxxxxxxxxx> · Thu, 10 Dec 2020 17:12:03 -0600

Hi Janek,

We realize this, we referenced that issue in our initial email. We do want
the metrics exposed by Ceph internally, and would prefer to work towards a
fix upstream. We appreciate the suggestion for a workaround, however!

Again, we're happy to provide whatever information we can that would be of
assistance. If there's some debug setting that is preferred, we are happy
to implement it, as this is currently a test cluster for us to work through
issues such as this one.

David

On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff <
janek.bevendorff@xxxxxxxxxxxxx> wrote:

> Do you have the prometheus module enabled? Turn that off, it's causing
> issues. I replaced it with another ceph exporter from Github and almost
> forgot about it.
>
> Here's the relevant issue report:
> https://tracker.ceph.com/issues/39264#change-179946
>
> On 10/12/2020 16:43, Welby McRoberts wrote:
> > Hi Folks
> >
> > We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs with
> 24
> > per node) that the mgr's are, after a non specific period of time,
> dropping
> > out of the cluster. The logs only show the following:
> >
> > debug 2020-12-10T02:02:50.409+0000 7f1005840700  0 log_channel(cluster)
> log
> > [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB
> > used, 6.3 PiB / 6.3 PiB avail
> > debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient:
> > _check_auth_rotating possible clock skew, rotating keys expired way too
> > early (before 2020-12-10T02:20:59.226159+0000)
> > debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient:
> > _check_auth_rotating possible clock skew, rotating keys expired way too
> > early (before 2020-12-10T02:21:00.226310+0000)
> >
> > The _check_auth_rotating repeats approximately every second. The
> instances
> > are all syncing their time with NTP and have no issues on that front. A
> > restart of the mgr fixes the issue.
> >
> > It appears that this may be related to
> https://tracker.ceph.com/issues/39264.
> > The suggestion seems to be to disable prometheus metrics, however, this
> > obviously isn't realistic for a production environment where metrics are
> > critical for operations.
> >
> > Please let us know what additional information we can provide to assist
> in
> > resolving this critical issue.
> >
> > Cheers
> > Welby
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx