I won't speculate more into the MDS's stability, but I do wonder about the same thing. There is one file served by the MDS that would cause the ceph-fuse client to hang. It was a file that many people in the company relied on for data updates, so very noticeable. The only fix was to fail over the MDS. Since the free disk space dropped, I haven't heard anyone complain... <shrug> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff < janek.bevendorff@xxxxxxxxxxxxx> wrote: > If there is actually a connection, then it's no wonder our MDS kept > crashing. Our Ceph has 9.2PiB of available space at the moment. > > > On 26/03/2020 17:32, Paul Choi wrote: > > I can't quite explain what happened, but the Prometheus endpoint became > stable after the free disk space for the largest pool went substantially > lower than 1PB. > I wonder if there's some metric that exceeds the maximum size for some > int, double, etc? > > -Paul > > On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff < > janek.bevendorff@xxxxxxxxxxxxx> wrote: > >> I haven't seen any MGR hangs so far since I disabled the prometheus >> module. It seems like the module is not only slow, but kills the whole >> MGR when the cluster is sufficiently large, so these two issues are most >> likely connected. The issue has become much, much worse with 14.2.8. >> >> >> On 23/03/2020 09:00, Janek Bevendorff wrote: >> > I am running the very latest version of Nautilus. I will try setting up >> > an external exporter today and see if that fixes anything. Our cluster >> > is somewhat large-ish with 1248 OSDs, so I expect stat collection to >> > take "some" time, but it definitely shouldn't crush the MGRs all the >> time. >> > >> > On 21/03/2020 02:33, Paul Choi wrote: >> >> Hi Janek, >> >> >> >> What version of Ceph are you using? >> >> We also have a much smaller cluster running Nautilus, with no MDS. No >> >> Prometheus issues there. >> >> I won't speculate further than this but perhaps Nautilus doesn't have >> >> the same issue as Mimic? >> >> >> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >> >> <janek.bevendorff@xxxxxxxxxxxxx >> >> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote: >> >> >> >> I think this is related to my previous post to this list about MGRs >> >> failing regularly and being overall quite slow to respond. The >> problem >> >> has existed before, but the new version has made it way worse. My >> MGRs >> >> keep dyring every few hours and need to be restarted. the Promtheus >> >> plugin works, but it's pretty slow and so is the dashboard. >> >> Unfortunately, nobody seems to have a solution for this and I >> >> wonder why >> >> not more people are complaining about this problem. >> >> >> >> >> >> On 20/03/2020 19:30, Paul Choi wrote: >> >> > If I "curl http://localhost:9283/metrics" and wait sufficiently >> long >> >> > enough, I get this - says "No MON connection". But the mons are >> >> health and >> >> > the cluster is functioning fine. >> >> > That said, the mons' rocksdb sizes are fairly big because >> >> there's lots of >> >> > rebalancing going on. The Prometheus endpoint hanging seems to >> >> happen >> >> > regardless of the mon size anyhow. >> >> > >> >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB) >> >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB) >> >> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB) >> >> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB) >> >> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB) >> >> > >> >> > # fg >> >> > curl -H "Connection: close" http://localhost:9283/metrics >> >> > <!DOCTYPE html PUBLIC >> >> > "-//W3C//DTD XHTML 1.0 Transitional//EN" >> >> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> >> > <html> >> >> > <head> >> >> > <meta http-equiv="Content-Type" content="text/html; >> >> > charset=utf-8"></meta> >> >> > <title>503 Service Unavailable</title> >> >> > <style type="text/css"> >> >> > #powered_by { >> >> > margin-top: 20px; >> >> > border-top: 2px solid black; >> >> > font-style: italic; >> >> > } >> >> > >> >> > #traceback { >> >> > color: red; >> >> > } >> >> > </style> >> >> > </head> >> >> > <body> >> >> > <h2>503 Service Unavailable</h2> >> >> > <p>No MON connection</p> >> >> > <pre id="traceback">Traceback (most recent call last): >> >> > File >> >> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line >> 670, >> >> > in respond >> >> > response.body = self.handler() >> >> > File >> >> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line >> >> > 217, in __call__ >> >> > self.body = self.oldhandler(*args, **kwargs) >> >> > File >> >> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line >> 61, >> >> > in __call__ >> >> > return self.callable(*self.args, **self.kwargs) >> >> > File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in >> >> metrics >> >> > return self._metrics(instance) >> >> > File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in >> >> _metrics >> >> > raise cherrypy.HTTPError(503, 'No MON connection') >> >> > HTTPError: (503, 'No MON connection') >> >> > </pre> >> >> > <div id="powered_by"> >> >> > <span> >> >> > Powered by <a href="http://www.cherrypy.org">CherryPy >> >> 3.5.0</a> >> >> > </span> >> >> > </div> >> >> > </body> >> >> > </html> >> >> > >> >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi@xxxxxxx >> >> <mailto:pchoi@xxxxxxx>> wrote: >> >> > >> >> >> Hello, >> >> >> >> >> >> We are running Mimic 13.2.8 with our cluster, and since >> >> upgrading to >> >> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to >> >> respond under >> >> >> 10s but now it often hangs. Restarting the mgr processes helps >> >> temporarily >> >> >> but within minutes it gets stuck again. >> >> >> >> >> >> The active mgr doesn't exit when doing `systemctl stop >> >> ceph-mgr.target" >> >> >> and needs to >> >> >> be kill -9'ed. >> >> >> >> >> >> Is there anything I can do to address this issue, or at least >> >> get better >> >> >> visibility into the issue? >> >> >> >> >> >> We only have a few plugins enabled: >> >> >> $ ceph mgr module ls >> >> >> { >> >> >> "enabled_modules": [ >> >> >> "balancer", >> >> >> "prometheus", >> >> >> "zabbix" >> >> >> ], >> >> >> >> >> >> 3 mgr processes, but it's a pretty large cluster (near 4000 >> >> OSDs) and it's >> >> >> a busy one with lots of rebalancing. (I don't know if a busy >> >> cluster would >> >> >> seriously affect the mgr's performance, but just throwing it >> >> out there) >> >> >> >> >> >> services: >> >> >> mon: 5 daemons, quorum >> >> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >> >> >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1 >> >> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 >> >> up:standby-replay >> >> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >> >> >> rgw: 4 daemons active >> >> >> >> >> >> Thanks in advance for your help, >> >> >> >> >> >> -Paul Choi >> >> >> >> >> > _______________________________________________ >> >> > ceph-users mailing list -- ceph-users@xxxxxxx >> >> <mailto:ceph-users@xxxxxxx> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> <mailto:ceph-users-leave@xxxxxxx> >> >> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx