We've got a PR in to fix this; we validated it resolves the issue in our larger clusters. We could use some help getting this moved forward since it seems to impact a number of users: https://github.com/ceph/ceph/pull/38677 On Fri, Dec 11, 2020 at 9:10 AM David Orman <ormandj@xxxxxxxxxxxx> wrote: > No, as the number of responses we've seen in the mailing lists and on the > bug report(s) have indicated it fixed the situation, we didn't proceed down > that path (it seemed highly probable it would resolve things). If it's of > additional value, we can disable the module temporarily to see if the > problem no longer presents itself, but our intent would not be to continue > to leave the module disabled and instead work towards resolution of the > issue at hand. > > Let us know if disabling this module would assist in troubleshooting, and > we're happy to do so. > > FWIW - we've also built a container with all of the debuginfo packages and > gdb setup to inspect the unresponsive ceph-mgr process, but our > understanding of ceph's internal workings is not deep enough to determine > why it appears to be deadlocking. That said, we welcome any requests for > any additional information we can provide to assist in determining the > cause/implementation of a solution. > > David > > On Fri, Dec 11, 2020 at 8:10 AM Wido den Hollander <wido@xxxxxxxx> wrote: > >> >> >> On 11/12/2020 00:12, David Orman wrote: >> > Hi Janek, >> > >> > We realize this, we referenced that issue in our initial email. We do >> want >> > the metrics exposed by Ceph internally, and would prefer to work >> towards a >> > fix upstream. We appreciate the suggestion for a workaround, however! >> > >> > Again, we're happy to provide whatever information we can that would be >> of >> > assistance. If there's some debug setting that is preferred, we are >> happy >> > to implement it, as this is currently a test cluster for us to work >> through >> > issues such as this one. >> > >> >> Have you tried disabling Prometheus just to see if this also fixes the >> issue for you? >> >> Wido >> >> > David >> > >> > On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff < >> > janek.bevendorff@xxxxxxxxxxxxx> wrote: >> > >> >> Do you have the prometheus module enabled? Turn that off, it's causing >> >> issues. I replaced it with another ceph exporter from Github and almost >> >> forgot about it. >> >> >> >> Here's the relevant issue report: >> >> https://tracker.ceph.com/issues/39264#change-179946 >> >> >> >> On 10/12/2020 16:43, Welby McRoberts wrote: >> >>> Hi Folks >> >>> >> >>> We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs >> with >> >> 24 >> >>> per node) that the mgr's are, after a non specific period of time, >> >> dropping >> >>> out of the cluster. The logs only show the following: >> >>> >> >>> debug 2020-12-10T02:02:50.409+0000 7f1005840700 0 >> log_channel(cluster) >> >> log >> >>> [DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB >> >>> used, 6.3 PiB / 6.3 PiB avail >> >>> debug 2020-12-10T03:20:59.223+0000 7f10624eb700 -1 monclient: >> >>> _check_auth_rotating possible clock skew, rotating keys expired way >> too >> >>> early (before 2020-12-10T02:20:59.226159+0000) >> >>> debug 2020-12-10T03:21:00.223+0000 7f10624eb700 -1 monclient: >> >>> _check_auth_rotating possible clock skew, rotating keys expired way >> too >> >>> early (before 2020-12-10T02:21:00.226310+0000) >> >>> >> >>> The _check_auth_rotating repeats approximately every second. The >> >> instances >> >>> are all syncing their time with NTP and have no issues on that front. >> A >> >>> restart of the mgr fixes the issue. >> >>> >> >>> It appears that this may be related to >> >> https://tracker.ceph.com/issues/39264. >> >>> The suggestion seems to be to disable prometheus metrics, however, >> this >> >>> obviously isn't realistic for a production environment where metrics >> are >> >>> critical for operations. >> >>> >> >>> Please let us know what additional information we can provide to >> assist >> >> in >> >>> resolving this critical issue. >> >>> >> >>> Cheers >> >>> Welby >> >>> _______________________________________________ >> >>> ceph-users mailing list -- ceph-users@xxxxxxx >> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> _______________________________________________ >> >> ceph-users mailing list -- ceph-users@xxxxxxx >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx