I've reported stability problems with ceph-mgr w/ prometheus plugin enabled on all versions we ran in production which were several versions of Luminous and Mimic. Our solution was to disable the prometheus exporter. I am using Zabbix instead. Our cluster is 1404 OSD's in size with about 9PB raw with around 35% utilization. On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> wrote: > > Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were > failing constantly due to the prometheus module doing something funny. > > > On 26/03/2020 18:10, Paul Choi wrote: > > I won't speculate more into the MDS's stability, but I do wonder about > > the same thing. > > There is one file served by the MDS that would cause the ceph-fuse > > client to hang. It was a file that many people in the company relied > > on for data updates, so very noticeable. The only fix was to fail over > > the MDS. > > > > Since the free disk space dropped, I haven't heard anyone complain... > > <shrug> > > > > On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff > > <janek.bevendorff@xxxxxxxxxxxxx > > <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote: > > > > If there is actually a connection, then it's no wonder our MDS > > kept crashing. Our Ceph has 9.2PiB of available space at the moment. > > > > > > On 26/03/2020 17:32, Paul Choi wrote: > >> I can't quite explain what happened, but the Prometheus endpoint > >> became stable after the free disk space for the largest pool went > >> substantially lower than 1PB. > >> I wonder if there's some metric that exceeds the maximum size for > >> some int, double, etc? > >> > >> -Paul > >> > >> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff > >> <janek.bevendorff@xxxxxxxxxxxxx > >> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote: > >> > >> I haven't seen any MGR hangs so far since I disabled the > >> prometheus > >> module. It seems like the module is not only slow, but kills > >> the whole > >> MGR when the cluster is sufficiently large, so these two > >> issues are most > >> likely connected. The issue has become much, much worse with > >> 14.2.8. > >> > >> > >> On 23/03/2020 09:00, Janek Bevendorff wrote: > >> > I am running the very latest version of Nautilus. I will > >> try setting up > >> > an external exporter today and see if that fixes anything. > >> Our cluster > >> > is somewhat large-ish with 1248 OSDs, so I expect stat > >> collection to > >> > take "some" time, but it definitely shouldn't crush the > >> MGRs all the time. > >> > > >> > On 21/03/2020 02:33, Paul Choi wrote: > >> >> Hi Janek, > >> >> > >> >> What version of Ceph are you using? > >> >> We also have a much smaller cluster running Nautilus, with > >> no MDS. No > >> >> Prometheus issues there. > >> >> I won't speculate further than this but perhaps Nautilus > >> doesn't have > >> >> the same issue as Mimic? > >> >> > >> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff > >> >> <janek.bevendorff@xxxxxxxxxxxxx > >> <mailto:janek.bevendorff@xxxxxxxxxxxxx> > >> >> <mailto:janek.bevendorff@xxxxxxxxxxxxx > >> <mailto:janek.bevendorff@xxxxxxxxxxxxx>>> wrote: > >> >> > >> >> I think this is related to my previous post to this > >> list about MGRs > >> >> failing regularly and being overall quite slow to > >> respond. The problem > >> >> has existed before, but the new version has made it > >> way worse. My MGRs > >> >> keep dyring every few hours and need to be restarted. > >> the Promtheus > >> >> plugin works, but it's pretty slow and so is the > >> dashboard. > >> >> Unfortunately, nobody seems to have a solution for > >> this and I > >> >> wonder why > >> >> not more people are complaining about this problem. > >> >> > >> >> > >> >> On 20/03/2020 19:30, Paul Choi wrote: > >> >> > If I "curl http://localhost:9283/metrics" and wait > >> sufficiently long > >> >> > enough, I get this - says "No MON connection". But > >> the mons are > >> >> health and > >> >> > the cluster is functioning fine. > >> >> > That said, the mons' rocksdb sizes are fairly big > >> because > >> >> there's lots of > >> >> > rebalancing going on. The Prometheus endpoint > >> hanging seems to > >> >> happen > >> >> > regardless of the mon size anyhow. > >> >> > > >> >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn > >> (15 GiB) > >> >> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn > >> (15 GiB) > >> >> > > >> >> > # fg > >> >> > curl -H "Connection: close" > >> http://localhost:9283/metrics > >> >> > <!DOCTYPE html PUBLIC > >> >> > "-//W3C//DTD XHTML 1.0 Transitional//EN" > >> >> > > >> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > >> >> > <html> > >> >> > <head> > >> >> > <meta http-equiv="Content-Type" content="text/html; > >> >> > charset=utf-8"></meta> > >> >> > <title>503 Service Unavailable</title> > >> >> > <style type="text/css"> > >> >> > #powered_by { > >> >> > margin-top: 20px; > >> >> > border-top: 2px solid black; > >> >> > font-style: italic; > >> >> > } > >> >> > > >> >> > #traceback { > >> >> > color: red; > >> >> > } > >> >> > </style> > >> >> > </head> > >> >> > <body> > >> >> > <h2>503 Service Unavailable</h2> > >> >> > <p>No MON connection</p> > >> >> > <pre id="traceback">Traceback (most recent > >> call last): > >> >> > File > >> >> > >> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", > >> line 670, > >> >> > in respond > >> >> > response.body = self.handler() > >> >> > File > >> >> > >> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", > >> line > >> >> > 217, in __call__ > >> >> > self.body = self.oldhandler(*args, **kwargs) > >> >> > File > >> >> > >> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", > >> line 61, > >> >> > in __call__ > >> >> > return self.callable(*self.args, **self.kwargs) > >> >> > File "/usr/lib/ceph/mgr/prometheus/module.py", > >> line 704, in > >> >> metrics > >> >> > return self._metrics(instance) > >> >> > File "/usr/lib/ceph/mgr/prometheus/module.py", > >> line 721, in > >> >> _metrics > >> >> > raise cherrypy.HTTPError(503, 'No MON connection') > >> >> > HTTPError: (503, 'No MON connection') > >> >> > </pre> > >> >> > <div id="powered_by"> > >> >> > <span> > >> >> > Powered by <a > >> href="http://www.cherrypy.org">CherryPy > >> >> 3.5.0</a> > >> >> > </span> > >> >> > </div> > >> >> > </body> > >> >> > </html> > >> >> > > >> >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi > >> <pchoi@xxxxxxx <mailto:pchoi@xxxxxxx> > >> >> <mailto:pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>>> wrote: > >> >> > > >> >> >> Hello, > >> >> >> > >> >> >> We are running Mimic 13.2.8 with our cluster, and since > >> >> upgrading to > >> >> >> 13.2.8 the Prometheus plugin seems to hang a lot. > >> It used to > >> >> respond under > >> >> >> 10s but now it often hangs. Restarting the mgr > >> processes helps > >> >> temporarily > >> >> >> but within minutes it gets stuck again. > >> >> >> > >> >> >> The active mgr doesn't exit when doing `systemctl stop > >> >> ceph-mgr.target" > >> >> >> and needs to > >> >> >> be kill -9'ed. > >> >> >> > >> >> >> Is there anything I can do to address this issue, > >> or at least > >> >> get better > >> >> >> visibility into the issue? > >> >> >> > >> >> >> We only have a few plugins enabled: > >> >> >> $ ceph mgr module ls > >> >> >> { > >> >> >> "enabled_modules": [ > >> >> >> "balancer", > >> >> >> "prometheus", > >> >> >> "zabbix" > >> >> >> ], > >> >> >> > >> >> >> 3 mgr processes, but it's a pretty large cluster > >> (near 4000 > >> >> OSDs) and it's > >> >> >> a busy one with lots of rebalancing. (I don't know > >> if a busy > >> >> cluster would > >> >> >> seriously affect the mgr's performance, but just > >> throwing it > >> >> out there) > >> >> >> > >> >> >> services: > >> >> >> mon: 5 daemons, quorum > >> >> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 > >> >> >> mgr: woodenbox2(active), standbys: woodenbox0, > >> woodenbox1 > >> >> >> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 > >> >> up:standby-replay > >> >> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs > >> >> >> rgw: 4 daemons active > >> >> >> > >> >> >> Thanks in advance for your help, > >> >> >> > >> >> >> -Paul Choi > >> >> >> > >> >> > _______________________________________________ > >> >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> <mailto:ceph-users@xxxxxxx> > >> >> <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>> > >> >> > To unsubscribe send an email to > >> ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> > >> >> <mailto:ceph-users-leave@xxxxxxx > >> <mailto:ceph-users-leave@xxxxxxx>> > >> >> > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> <mailto:ceph-users@xxxxxxx> > >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> <mailto:ceph-users-leave@xxxxxxx> > >> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx