> I’m actually very curious how well this is performing for you as I’ve definitely not seen a deployment this large. How do you use it? What exactly do you mean? Our cluster has 11PiB capacity of which about 15% are used at the moment (web-scale corpora and such). We have deployed 5 MONs and 5MGRs (both on the same hosts) and it works totally fine overall. We have some MDS performance issues here and there, but that's not too bad anymore after a few upstream patches and then we have this annoying Prometheus MGR problem, which kills our MGRs reliably after a few hours. > >> On Mar 27, 2020, at 11:47 AM, shubjero <shubjero@xxxxxxxxx> wrote: >> >> I've reported stability problems with ceph-mgr w/ prometheus plugin >> enabled on all versions we ran in production which were several >> versions of Luminous and Mimic. Our solution was to disable the >> prometheus exporter. I am using Zabbix instead. Our cluster is 1404 >> OSD's in size with about 9PB raw with around 35% utilization. >> >> On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff >> <janek.bevendorff@xxxxxxxxxxxxx> wrote: >>> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were >>> failing constantly due to the prometheus module doing something funny. >>> >>> >>> On 26/03/2020 18:10, Paul Choi wrote: >>>> I won't speculate more into the MDS's stability, but I do wonder about >>>> the same thing. >>>> There is one file served by the MDS that would cause the ceph-fuse >>>> client to hang. It was a file that many people in the company relied >>>> on for data updates, so very noticeable. The only fix was to fail over >>>> the MDS. >>>> >>>> Since the free disk space dropped, I haven't heard anyone complain... >>>> <shrug> >>>> >>>> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff >>>> <janek.bevendorff@xxxxxxxxxxxxx >>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote: >>>> >>>> If there is actually a connection, then it's no wonder our MDS >>>> kept crashing. Our Ceph has 9.2PiB of available space at the moment. >>>> >>>> >>>> On 26/03/2020 17:32, Paul Choi wrote: >>>>> I can't quite explain what happened, but the Prometheus endpoint >>>>> became stable after the free disk space for the largest pool went >>>>> substantially lower than 1PB. >>>>> I wonder if there's some metric that exceeds the maximum size for >>>>> some int, double, etc? >>>>> >>>>> -Paul >>>>> >>>>> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff >>>>> <janek.bevendorff@xxxxxxxxxxxxx >>>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote: >>>>> >>>>> I haven't seen any MGR hangs so far since I disabled the >>>>> prometheus >>>>> module. It seems like the module is not only slow, but kills >>>>> the whole >>>>> MGR when the cluster is sufficiently large, so these two >>>>> issues are most >>>>> likely connected. The issue has become much, much worse with >>>>> 14.2.8. >>>>> >>>>> >>>>> On 23/03/2020 09:00, Janek Bevendorff wrote: >>>>>> I am running the very latest version of Nautilus. I will >>>>> try setting up >>>>>> an external exporter today and see if that fixes anything. >>>>> Our cluster >>>>>> is somewhat large-ish with 1248 OSDs, so I expect stat >>>>> collection to >>>>>> take "some" time, but it definitely shouldn't crush the >>>>> MGRs all the time. >>>>>> On 21/03/2020 02:33, Paul Choi wrote: >>>>>>> Hi Janek, >>>>>>> >>>>>>> What version of Ceph are you using? >>>>>>> We also have a much smaller cluster running Nautilus, with >>>>> no MDS. No >>>>>>> Prometheus issues there. >>>>>>> I won't speculate further than this but perhaps Nautilus >>>>> doesn't have >>>>>>> the same issue as Mimic? >>>>>>> >>>>>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >>>>>>> <janek.bevendorff@xxxxxxxxxxxxx >>>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx> >>>>>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx >>>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx>>> wrote: >>>>>>> I think this is related to my previous post to this >>>>> list about MGRs >>>>>>> failing regularly and being overall quite slow to >>>>> respond. The problem >>>>>>> has existed before, but the new version has made it >>>>> way worse. My MGRs >>>>>>> keep dyring every few hours and need to be restarted. >>>>> the Promtheus >>>>>>> plugin works, but it's pretty slow and so is the >>>>> dashboard. >>>>>>> Unfortunately, nobody seems to have a solution for >>>>> this and I >>>>>>> wonder why >>>>>>> not more people are complaining about this problem. >>>>>>> >>>>>>> >>>>>>> On 20/03/2020 19:30, Paul Choi wrote: >>>>>>>> If I "curl http://localhost:9283/metrics" and wait >>>>> sufficiently long >>>>>>>> enough, I get this - says "No MON connection". But >>>>> the mons are >>>>>>> health and >>>>>>>> the cluster is functioning fine. >>>>>>>> That said, the mons' rocksdb sizes are fairly big >>>>> because >>>>>>> there's lots of >>>>>>>> rebalancing going on. The Prometheus endpoint >>>>> hanging seems to >>>>>>> happen >>>>>>>> regardless of the mon size anyhow. >>>>>>>> >>>>>>>> mon.woodenbox0 is 41 GiB >= mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox2 is 26 GiB >= mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox4 is 42 GiB >= mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox3 is 43 GiB >= mon_data_size_warn >>>>> (15 GiB) >>>>>>>> mon.woodenbox1 is 38 GiB >= mon_data_size_warn >>>>> (15 GiB) >>>>>>>> # fg >>>>>>>> curl -H "Connection: close" >>>>> http://localhost:9283/metrics >>>>>>>> <!DOCTYPE html PUBLIC >>>>>>>> "-//W3C//DTD XHTML 1.0 Transitional//EN" >>>>>>>> >>>>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >>>>>>>> <html> >>>>>>>> <head> >>>>>>>> <meta http-equiv="Content-Type" content="text/html; >>>>>>>> charset=utf-8"></meta> >>>>>>>> <title>503 Service Unavailable</title> >>>>>>>> <style type="text/css"> >>>>>>>> #powered_by { >>>>>>>> margin-top: 20px; >>>>>>>> border-top: 2px solid black; >>>>>>>> font-style: italic; >>>>>>>> } >>>>>>>> >>>>>>>> #traceback { >>>>>>>> color: red; >>>>>>>> } >>>>>>>> </style> >>>>>>>> </head> >>>>>>>> <body> >>>>>>>> <h2>503 Service Unavailable</h2> >>>>>>>> <p>No MON connection</p> >>>>>>>> <pre id="traceback">Traceback (most recent >>>>> call last): >>>>>>>> File >>>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", >>>>> line 670, >>>>>>>> in respond >>>>>>>> response.body = self.handler() >>>>>>>> File >>>>> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", >>>>> line >>>>>>>> 217, in __call__ >>>>>>>> self.body = self.oldhandler(*args, **kwargs) >>>>>>>> File >>>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", >>>>> line 61, >>>>>>>> in __call__ >>>>>>>> return self.callable(*self.args, **self.kwargs) >>>>>>>> File "/usr/lib/ceph/mgr/prometheus/module.py", >>>>> line 704, in >>>>>>> metrics >>>>>>>> return self._metrics(instance) >>>>>>>> File "/usr/lib/ceph/mgr/prometheus/module.py", >>>>> line 721, in >>>>>>> _metrics >>>>>>>> raise cherrypy.HTTPError(503, 'No MON connection') >>>>>>>> HTTPError: (503, 'No MON connection') >>>>>>>> </pre> >>>>>>>> <div id="powered_by"> >>>>>>>> <span> >>>>>>>> Powered by <a >>>>> href="http://www.cherrypy.org">CherryPy >>>>>>> 3.5.0</a> >>>>>>>> </span> >>>>>>>> </div> >>>>>>>> </body> >>>>>>>> </html> >>>>>>>> >>>>>>>> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi >>>>> <pchoi@xxxxxxx <mailto:pchoi@xxxxxxx> >>>>>>> <mailto:pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>>> wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> We are running Mimic 13.2.8 with our cluster, and since >>>>>>> upgrading to >>>>>>>>> 13.2.8 the Prometheus plugin seems to hang a lot. >>>>> It used to >>>>>>> respond under >>>>>>>>> 10s but now it often hangs. Restarting the mgr >>>>> processes helps >>>>>>> temporarily >>>>>>>>> but within minutes it gets stuck again. >>>>>>>>> >>>>>>>>> The active mgr doesn't exit when doing `systemctl stop >>>>>>> ceph-mgr.target" >>>>>>>>> and needs to >>>>>>>>> be kill -9'ed. >>>>>>>>> >>>>>>>>> Is there anything I can do to address this issue, >>>>> or at least >>>>>>> get better >>>>>>>>> visibility into the issue? >>>>>>>>> >>>>>>>>> We only have a few plugins enabled: >>>>>>>>> $ ceph mgr module ls >>>>>>>>> { >>>>>>>>> "enabled_modules": [ >>>>>>>>> "balancer", >>>>>>>>> "prometheus", >>>>>>>>> "zabbix" >>>>>>>>> ], >>>>>>>>> >>>>>>>>> 3 mgr processes, but it's a pretty large cluster >>>>> (near 4000 >>>>>>> OSDs) and it's >>>>>>>>> a busy one with lots of rebalancing. (I don't know >>>>> if a busy >>>>>>> cluster would >>>>>>>>> seriously affect the mgr's performance, but just >>>>> throwing it >>>>>>> out there) >>>>>>>>> services: >>>>>>>>> mon: 5 daemons, quorum >>>>>>>>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >>>>>>>>> mgr: woodenbox2(active), standbys: woodenbox0, >>>>> woodenbox1 >>>>>>>>> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 >>>>>>> up:standby-replay >>>>>>>>> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >>>>>>>>> rgw: 4 daemons active >>>>>>>>> >>>>>>>>> Thanks in advance for your help, >>>>>>>>> >>>>>>>>> -Paul Choi >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> <mailto:ceph-users@xxxxxxx> >>>>>>> <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>> >>>>>>>> To unsubscribe send an email to >>>>> ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> >>>>>>> <mailto:ceph-users-leave@xxxxxxx >>>>> <mailto:ceph-users-leave@xxxxxxx>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> <mailto:ceph-users@xxxxxxx> >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> <mailto:ceph-users-leave@xxxxxxx> >>>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx