Hi Janek, We realize this, we referenced that issue in our initial email. We do want the metrics exposed by Ceph internally, and would prefer to work towards a fix upstream. We appreciate the suggestion for a workaround, however! Again, we're happy to provide whatever information we can that would be of assistance. If there's some debug setting that is preferred, we are happy to implement it, as this is currently a test cluster for us to work through issues such as this one. David On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff < janek.bevendorff@xxxxxxxxxxxxx> wrote: > Do you have the prometheus module enabled? Turn that off, it's causing > issues. I replaced it with another ceph exporter from Github and almost > forgot about it. > > Here's the relevant issue report: > https://tracker.ceph.com/issues/39264#change-179946 > > On 10/12/2020 16:43, Welby McRoberts wrote: > > Hi Folks > > > > We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs with > 24 > > per node) that the mgr's are, after a non specific period of time, > dropping > > out of the cluster. The logs only show the following: > > > > debug 2020-12-10T02:02:50.409+0000 7f1005840700 0 log_channel(cluster) > log > > [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB > > used, 6.3 PiB / 6.3 PiB avail > > debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient: > > _check_auth_rotating possible clock skew, rotating keys expired way too > > early (before 2020-12-10T02:20:59.226159+0000) > > debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient: > > _check_auth_rotating possible clock skew, rotating keys expired way too > > early (before 2020-12-10T02:21:00.226310+0000) > > > > The _check_auth_rotating repeats approximately every second. The > instances > > are all syncing their time with NTP and have no issues on that front. A > > restart of the mgr fixes the issue. > > > > It appears that this may be related to > https://tracker.ceph.com/issues/39264. > > The suggestion seems to be to disable prometheus metrics, however, this > > obviously isn't realistic for a production environment where metrics are > > critical for operations. > > > > Please let us know what additional information we can provide to assist > in > > resolving this critical issue. > > > > Cheers > > Welby > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx