I dug up this issue report, where the problem has been reported before: https://tracker.ceph.com/issues/39264 Unfortuantely, the issue hasn't got much (or any) attention yet. So let's get this fixed, the prometheus module is unusable in its current state. On 23/03/2020 17:50, Janek Bevendorff wrote: > I haven't seen any MGR hangs so far since I disabled the prometheus > module. It seems like the module is not only slow, but kills the whole > MGR when the cluster is sufficiently large, so these two issues are most > likely connected. The issue has become much, much worse with 14.2.8. > > > On 23/03/2020 09:00, Janek Bevendorff wrote: >> I am running the very latest version of Nautilus. I will try setting up >> an external exporter today and see if that fixes anything. Our cluster >> is somewhat large-ish with 1248 OSDs, so I expect stat collection to >> take "some" time, but it definitely shouldn't crush the MGRs all the time. >> >> On 21/03/2020 02:33, Paul Choi wrote: >>> Hi Janek, >>> >>> What version of Ceph are you using? >>> We also have a much smaller cluster running Nautilus, with no MDS. No >>> Prometheus issues there. >>> I won't speculate further than this but perhaps Nautilus doesn't have >>> the same issue as Mimic? >>> >>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >>> <janek.bevendorff@xxxxxxxxxxxxx >>> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote: >>> >>> I think this is related to my previous post to this list about MGRs >>> failing regularly and being overall quite slow to respond. The problem >>> has existed before, but the new version has made it way worse. My MGRs >>> keep dyring every few hours and need to be restarted. the Promtheus >>> plugin works, but it's pretty slow and so is the dashboard. >>> Unfortunately, nobody seems to have a solution for this and I >>> wonder why >>> not more people are complaining about this problem. >>> >>> >>> On 20/03/2020 19:30, Paul Choi wrote: >>> > If I "curl http://localhost:9283/metrics" and wait sufficiently long >>> > enough, I get this - says "No MON connection". But the mons are >>> health and >>> > the cluster is functioning fine. >>> > That said, the mons' rocksdb sizes are fairly big because >>> there's lots of >>> > rebalancing going on. The Prometheus endpoint hanging seems to >>> happen >>> > regardless of the mon size anyhow. >>> > >>> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) >>> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) >>> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) >>> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) >>> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) >>> > >>> > # fg >>> > curl -H "Connection: close" http://localhost:9283/metrics >>> > <!DOCTYPE html PUBLIC >>> > "-//W3C//DTD XHTML 1.0 Transitional//EN" >>> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >>> > <html> >>> > <head> >>> > <meta http-equiv="Content-Type" content="text/html; >>> > charset=utf-8"></meta> >>> > <title>503 Service Unavailable</title> >>> > <style type="text/css"> >>> > #powered_by { >>> > margin-top: 20px; >>> > border-top: 2px solid black; >>> > font-style: italic; >>> > } >>> > >>> > #traceback { >>> > color: red; >>> > } >>> > </style> >>> > </head> >>> > <body> >>> > <h2>503 Service Unavailable</h2> >>> > <p>No MON connection</p> >>> > <pre id="traceback">Traceback (most recent call last): >>> > File >>> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, >>> > in respond >>> > response.body = self.handler() >>> > File >>> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line >>> > 217, in __call__ >>> > self.body = self.oldhandler(*args, **kwargs) >>> > File >>> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, >>> > in __call__ >>> > return self.callable(*self.args, **self.kwargs) >>> > File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in >>> metrics >>> > return self._metrics(instance) >>> > File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in >>> _metrics >>> > raise cherrypy.HTTPError(503, 'No MON connection') >>> > HTTPError: (503, 'No MON connection') >>> > </pre> >>> > <div id="powered_by"> >>> > <span> >>> > Powered by <a href="http://www.cherrypy.org">CherryPy >>> 3.5.0</a> >>> > </span> >>> > </div> >>> > </body> >>> > </html> >>> > >>> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi@xxxxxxx >>> <mailto:pchoi@xxxxxxx>> wrote: >>> > >>> >> Hello, >>> >> >>> >> We are running Mimic 13.2.8 with our cluster, and since >>> upgrading to >>> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to >>> respond under >>> >> 10s but now it often hangs. Restarting the mgr processes helps >>> temporarily >>> >> but within minutes it gets stuck again. >>> >> >>> >> The active mgr doesn't exit when doing `systemctl stop >>> ceph-mgr.target" >>> >> and needs to >>> >> be kill -9'ed. >>> >> >>> >> Is there anything I can do to address this issue, or at least >>> get better >>> >> visibility into the issue? >>> >> >>> >> We only have a few plugins enabled: >>> >> $ ceph mgr module ls >>> >> { >>> >> "enabled_modules": [ >>> >> "balancer", >>> >> "prometheus", >>> >> "zabbix" >>> >> ], >>> >> >>> >> 3 mgr processes, but it's a pretty large cluster (near 4000 >>> OSDs) and it's >>> >> a busy one with lots of rebalancing. (I don't know if a busy >>> cluster would >>> >> seriously affect the mgr's performance, but just throwing it >>> out there) >>> >> >>> >> services: >>> >> mon: 5 daemons, quorum >>> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >>> >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 >>> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 >>> up:standby-replay >>> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >>> >> rgw: 4 daemons active >>> >> >>> >> Thanks in advance for your help, >>> >> >>> >> -Paul Choi >>> >> >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users@xxxxxxx >>> <mailto:ceph-users@xxxxxxx> >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> <mailto:ceph-users-leave@xxxxxxx> >>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx