Re: mgr's stop responding, dropping out of cluster with _check_auth_rotating

David Orman <ormandj@xxxxxxxxxxxx> · Mon, 21 Dec 2020 10:17:28 -0600

We've got a PR in to fix this; we validated it resolves the issue in our
larger clusters. We could use some help getting this moved forward since it
seems to impact a number of users:

https://github.com/ceph/ceph/pull/38677

On Fri, Dec 11, 2020 at 9:10 AM David Orman <ormandj@xxxxxxxxxxxx> wrote:

> No, as the number of responses we've seen in the mailing lists and on the
> bug report(s) have indicated it fixed the situation, we didn't proceed down
> that path (it seemed highly probable it would resolve things). If it's of
> additional value, we can disable the module temporarily to see if the
> problem no longer presents itself, but our intent would not be to continue
> to leave the module disabled and instead work towards resolution of the
> issue at hand.
>
> Let us know if disabling this module would assist in troubleshooting, and
> we're happy to do so.
>
> FWIW - we've also built a container with all of the debuginfo packages and
> gdb setup to inspect the unresponsive ceph-mgr process, but our
> understanding of ceph's internal workings is not deep enough to determine
> why it appears to be deadlocking. That said, we welcome any requests for
> any additional information we can provide to assist in determining the
> cause/implementation of a solution.
>
> David
>
> On Fri, Dec 11, 2020 at 8:10 AM Wido den Hollander <wido@xxxxxxxx> wrote:
>
>>
>>
>> On 11/12/2020 00:12, David Orman wrote:
>> > Hi Janek,
>> >
>> > We realize this, we referenced that issue in our initial email. We do
>> want
>> > the metrics exposed by Ceph internally, and would prefer to work
>> towards a
>> > fix upstream. We appreciate the suggestion for a workaround, however!
>> >
>> > Again, we're happy to provide whatever information we can that would be
>> of
>> > assistance. If there's some debug setting that is preferred, we are
>> happy
>> > to implement it, as this is currently a test cluster for us to work
>> through
>> > issues such as this one.
>> >
>>
>> Have you tried disabling Prometheus just to see if this also fixes the
>> issue for you?
>>
>> Wido
>>
>> > David
>> >
>> > On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff <
>> > janek.bevendorff@xxxxxxxxxxxxx> wrote:
>> >
>> >> Do you have the prometheus module enabled? Turn that off, it's causing
>> >> issues. I replaced it with another ceph exporter from Github and almost
>> >> forgot about it.
>> >>
>> >> Here's the relevant issue report:
>> >> https://tracker.ceph.com/issues/39264#change-179946
>> >>
>> >> On 10/12/2020 16:43, Welby McRoberts wrote:
>> >>> Hi Folks
>> >>>
>> >>> We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs
>> with
>> >> 24
>> >>> per node) that the mgr's are, after a non specific period of time,
>> >> dropping
>> >>> out of the cluster. The logs only show the following:
>> >>>
>> >>> debug 2020-12-10T02:02:50.409+0000 7f1005840700  0
>> log_channel(cluster)
>> >> log
>> >>> [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB
>> >>> used, 6.3 PiB / 6.3 PiB avail
>> >>> debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient:
>> >>> _check_auth_rotating possible clock skew, rotating keys expired way
>> too
>> >>> early (before 2020-12-10T02:20:59.226159+0000)
>> >>> debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient:
>> >>> _check_auth_rotating possible clock skew, rotating keys expired way
>> too
>> >>> early (before 2020-12-10T02:21:00.226310+0000)
>> >>>
>> >>> The _check_auth_rotating repeats approximately every second. The
>> >> instances
>> >>> are all syncing their time with NTP and have no issues on that front.
>> A
>> >>> restart of the mgr fixes the issue.
>> >>>
>> >>> It appears that this may be related to
>> >> https://tracker.ceph.com/issues/39264.
>> >>> The suggestion seems to be to disable prometheus metrics, however,
>> this
>> >>> obviously isn't realistic for a production environment where metrics
>> are
>> >>> critical for operations.
>> >>>
>> >>> Please let us know what additional information we can provide to
>> assist
>> >> in
>> >>> resolving this critical issue.
>> >>>
>> >>> Cheers
>> >>> Welby
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx