Re: mgr's stop responding, dropping out of cluster with _check_auth_rotating

David Orman <ormandj@xxxxxxxxxxxx> · Fri, 11 Dec 2020 09:10:23 -0600

No, as the number of responses we've seen in the mailing lists and on the
bug report(s) have indicated it fixed the situation, we didn't proceed down
that path (it seemed highly probable it would resolve things). If it's of
additional value, we can disable the module temporarily to see if the
problem no longer presents itself, but our intent would not be to continue
to leave the module disabled and instead work towards resolution of the
issue at hand.

Let us know if disabling this module would assist in troubleshooting, and
we're happy to do so.

FWIW - we've also built a container with all of the debuginfo packages and
gdb setup to inspect the unresponsive ceph-mgr process, but our
understanding of ceph's internal workings is not deep enough to determine
why it appears to be deadlocking. That said, we welcome any requests for
any additional information we can provide to assist in determining the
cause/implementation of a solution.

David

On Fri, Dec 11, 2020 at 8:10 AM Wido den Hollander <wido@xxxxxxxx> wrote:

>
>
> On 11/12/2020 00:12, David Orman wrote:
> > Hi Janek,
> >
> > We realize this, we referenced that issue in our initial email. We do
> want
> > the metrics exposed by Ceph internally, and would prefer to work towards
> a
> > fix upstream. We appreciate the suggestion for a workaround, however!
> >
> > Again, we're happy to provide whatever information we can that would be
> of
> > assistance. If there's some debug setting that is preferred, we are happy
> > to implement it, as this is currently a test cluster for us to work
> through
> > issues such as this one.
> >
>
> Have you tried disabling Prometheus just to see if this also fixes the
> issue for you?
>
> Wido
>
> > David
> >
> > On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff <
> > janek.bevendorff@xxxxxxxxxxxxx> wrote:
> >
> >> Do you have the prometheus module enabled? Turn that off, it's causing
> >> issues. I replaced it with another ceph exporter from Github and almost
> >> forgot about it.
> >>
> >> Here's the relevant issue report:
> >> https://tracker.ceph.com/issues/39264#change-179946
> >>
> >> On 10/12/2020 16:43, Welby McRoberts wrote:
> >>> Hi Folks
> >>>
> >>> We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs
> with
> >> 24
> >>> per node) that the mgr's are, after a non specific period of time,
> >> dropping
> >>> out of the cluster. The logs only show the following:
> >>>
> >>> debug 2020-12-10T02:02:50.409+0000 7f1005840700  0 log_channel(cluster)
> >> log
> >>> [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB
> >>> used, 6.3 PiB / 6.3 PiB avail
> >>> debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient:
> >>> _check_auth_rotating possible clock skew, rotating keys expired way too
> >>> early (before 2020-12-10T02:20:59.226159+0000)
> >>> debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient:
> >>> _check_auth_rotating possible clock skew, rotating keys expired way too
> >>> early (before 2020-12-10T02:21:00.226310+0000)
> >>>
> >>> The _check_auth_rotating repeats approximately every second. The
> >> instances
> >>> are all syncing their time with NTP and have no issues on that front. A
> >>> restart of the mgr fixes the issue.
> >>>
> >>> It appears that this may be related to
> >> https://tracker.ceph.com/issues/39264.
> >>> The suggestion seems to be to disable prometheus metrics, however, this
> >>> obviously isn't realistic for a production environment where metrics
> are
> >>> critical for operations.
> >>>
> >>> Please let us know what additional information we can provide to assist
> >> in
> >>> resolving this critical issue.
> >>>
> >>> Cheers
> >>> Welby
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx