I am running the very latest version of Nautilus. I will try setting up an external exporter today and see if that fixes anything. Our cluster is somewhat large-ish with 1248 OSDs, so I expect stat collection to take "some" time, but it definitely shouldn't crush the MGRs all the time. On 21/03/2020 02:33, Paul Choi wrote: > Hi Janek, > > What version of Ceph are you using? > We also have a much smaller cluster running Nautilus, with no MDS. No > Prometheus issues there. > I won't speculate further than this but perhaps Nautilus doesn't have > the same issue as Mimic? > > On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff > <janek.bevendorff@xxxxxxxxxxxxx > <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote: > > I think this is related to my previous post to this list about MGRs > failing regularly and being overall quite slow to respond. The problem > has existed before, but the new version has made it way worse. My MGRs > keep dyring every few hours and need to be restarted. the Promtheus > plugin works, but it's pretty slow and so is the dashboard. > Unfortunately, nobody seems to have a solution for this and I > wonder why > not more people are complaining about this problem. > > > On 20/03/2020 19:30, Paul Choi wrote: > > If I "curl http://localhost:9283/metrics" and wait sufficiently long > > enough, I get this - says "No MON connection". But the mons are > health and > > the cluster is functioning fine. > > That said, the mons' rocksdb sizes are fairly big because > there's lots of > > rebalancing going on. The Prometheus endpoint hanging seems to > happen > > regardless of the mon size anyhow. > > > > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) > > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) > > > > # fg > > curl -H "Connection: close" http://localhost:9283/metrics > > <!DOCTYPE html PUBLIC > > "-//W3C//DTD XHTML 1.0 Transitional//EN" > > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > <html> > > <head> > > <meta http-equiv="Content-Type" content="text/html; > > charset=utf-8"></meta> > > <title>503 Service Unavailable</title> > > <style type="text/css"> > > #powered_by { > > margin-top: 20px; > > border-top: 2px solid black; > > font-style: italic; > > } > > > > #traceback { > > color: red; > > } > > </style> > > </head> > > <body> > > <h2>503 Service Unavailable</h2> > > <p>No MON connection</p> > > <pre id="traceback">Traceback (most recent call last): > > File > "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, > > in respond > > response.body = self.handler() > > File > "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line > > 217, in __call__ > > self.body = self.oldhandler(*args, **kwargs) > > File > "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, > > in __call__ > > return self.callable(*self.args, **self.kwargs) > > File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in > metrics > > return self._metrics(instance) > > File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in > _metrics > > raise cherrypy.HTTPError(503, 'No MON connection') > > HTTPError: (503, 'No MON connection') > > </pre> > > <div id="powered_by"> > > <span> > > Powered by <a href="http://www.cherrypy.org">CherryPy > 3.5.0</a> > > </span> > > </div> > > </body> > > </html> > > > > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi@xxxxxxx > <mailto:pchoi@xxxxxxx>> wrote: > > > >> Hello, > >> > >> We are running Mimic 13.2.8 with our cluster, and since > upgrading to > >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to > respond under > >> 10s but now it often hangs. Restarting the mgr processes helps > temporarily > >> but within minutes it gets stuck again. > >> > >> The active mgr doesn't exit when doing `systemctl stop > ceph-mgr.target" > >> and needs to > >> be kill -9'ed. > >> > >> Is there anything I can do to address this issue, or at least > get better > >> visibility into the issue? > >> > >> We only have a few plugins enabled: > >> $ ceph mgr module ls > >> { > >> "enabled_modules": [ > >> "balancer", > >> "prometheus", > >> "zabbix" > >> ], > >> > >> 3 mgr processes, but it's a pretty large cluster (near 4000 > OSDs) and it's > >> a busy one with lots of rebalancing. (I don't know if a busy > cluster would > >> seriously affect the mgr's performance, but just throwing it > out there) > >> > >> services: > >> mon: 5 daemons, quorum > >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 > >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 > up:standby-replay > >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > >> rgw: 4 daemons active > >> > >> Thanks in advance for your help, > >> > >> -Paul Choi > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > <mailto:ceph-users@xxxxxxx> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > <mailto:ceph-users-leave@xxxxxxx> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx