I’m actually very curious how well this is performing for you as I’ve definitely not seen a deployment this large. How do you use it? > On Mar 27, 2020, at 11:47 AM, shubjero <shubjero@xxxxxxxxx> wrote: > > I've reported stability problems with ceph-mgr w/ prometheus plugin > enabled on all versions we ran in production which were several > versions of Luminous and Mimic. Our solution was to disable the > prometheus exporter. I am using Zabbix instead. Our cluster is 1404 > OSD's in size with about 9PB raw with around 35% utilization. > > On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff > <janek.bevendorff@xxxxxxxxxxxxx> wrote: >> >> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were >> failing constantly due to the prometheus module doing something funny. >> >> >> On 26/03/2020 18:10, Paul Choi wrote: >>> I won't speculate more into the MDS's stability, but I do wonder about >>> the same thing. >>> There is one file served by the MDS that would cause the ceph-fuse >>> client to hang. It was a file that many people in the company relied >>> on for data updates, so very noticeable. The only fix was to fail over >>> the MDS. >>> >>> Since the free disk space dropped, I haven't heard anyone complain... >>> <shrug> >>> >>> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff >>> <janek.bevendorff@xxxxxxxxxxxxx >>> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote: >>> >>> If there is actually a connection, then it's no wonder our MDS >>> kept crashing. Our Ceph has 9.2PiB of available space at the moment. >>> >>> >>> On 26/03/2020 17:32, Paul Choi wrote: >>>> I can't quite explain what happened, but the Prometheus endpoint >>>> became stable after the free disk space for the largest pool went >>>> substantially lower than 1PB. >>>> I wonder if there's some metric that exceeds the maximum size for >>>> some int, double, etc? >>>> >>>> -Paul >>>> >>>> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff >>>> <janek.bevendorff@xxxxxxxxxxxxx >>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote: >>>> >>>> I haven't seen any MGR hangs so far since I disabled the >>>> prometheus >>>> module. It seems like the module is not only slow, but kills >>>> the whole >>>> MGR when the cluster is sufficiently large, so these two >>>> issues are most >>>> likely connected. The issue has become much, much worse with >>>> 14.2.8. >>>> >>>> >>>> On 23/03/2020 09:00, Janek Bevendorff wrote: >>>>> I am running the very latest version of Nautilus. I will >>>> try setting up >>>>> an external exporter today and see if that fixes anything. >>>> Our cluster >>>>> is somewhat large-ish with 1248 OSDs, so I expect stat >>>> collection to >>>>> take "some" time, but it definitely shouldn't crush the >>>> MGRs all the time. >>>>> >>>>> On 21/03/2020 02:33, Paul Choi wrote: >>>>>> Hi Janek, >>>>>> >>>>>> What version of Ceph are you using? >>>>>> We also have a much smaller cluster running Nautilus, with >>>> no MDS. No >>>>>> Prometheus issues there. >>>>>> I won't speculate further than this but perhaps Nautilus >>>> doesn't have >>>>>> the same issue as Mimic? >>>>>> >>>>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff >>>>>> <janek.bevendorff@xxxxxxxxxxxxx >>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx> >>>>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx >>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx>>> wrote: >>>>>> >>>>>> I think this is related to my previous post to this >>>> list about MGRs >>>>>> failing regularly and being overall quite slow to >>>> respond. The problem >>>>>> has existed before, but the new version has made it >>>> way worse. My MGRs >>>>>> keep dyring every few hours and need to be restarted. >>>> the Promtheus >>>>>> plugin works, but it's pretty slow and so is the >>>> dashboard. >>>>>> Unfortunately, nobody seems to have a solution for >>>> this and I >>>>>> wonder why >>>>>> not more people are complaining about this problem. >>>>>> >>>>>> >>>>>> On 20/03/2020 19:30, Paul Choi wrote: >>>>>>> If I "curl http://localhost:9283/metrics" and wait >>>> sufficiently long >>>>>>> enough, I get this - says "No MON connection". But >>>> the mons are >>>>>> health and >>>>>>> the cluster is functioning fine. >>>>>>> That said, the mons' rocksdb sizes are fairly big >>>> because >>>>>> there's lots of >>>>>>> rebalancing going on. The Prometheus endpoint >>>> hanging seems to >>>>>> happen >>>>>>> regardless of the mon size anyhow. >>>>>>> >>>>>>> mon.woodenbox0 is 41 GiB >= mon_data_size_warn >>>> (15 GiB) >>>>>>> mon.woodenbox2 is 26 GiB >= mon_data_size_warn >>>> (15 GiB) >>>>>>> mon.woodenbox4 is 42 GiB >= mon_data_size_warn >>>> (15 GiB) >>>>>>> mon.woodenbox3 is 43 GiB >= mon_data_size_warn >>>> (15 GiB) >>>>>>> mon.woodenbox1 is 38 GiB >= mon_data_size_warn >>>> (15 GiB) >>>>>>> >>>>>>> # fg >>>>>>> curl -H "Connection: close" >>>> http://localhost:9283/metrics >>>>>>> <!DOCTYPE html PUBLIC >>>>>>> "-//W3C//DTD XHTML 1.0 Transitional//EN" >>>>>>> >>>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >>>>>>> <html> >>>>>>> <head> >>>>>>> <meta http-equiv="Content-Type" content="text/html; >>>>>>> charset=utf-8"></meta> >>>>>>> <title>503 Service Unavailable</title> >>>>>>> <style type="text/css"> >>>>>>> #powered_by { >>>>>>> margin-top: 20px; >>>>>>> border-top: 2px solid black; >>>>>>> font-style: italic; >>>>>>> } >>>>>>> >>>>>>> #traceback { >>>>>>> color: red; >>>>>>> } >>>>>>> </style> >>>>>>> </head> >>>>>>> <body> >>>>>>> <h2>503 Service Unavailable</h2> >>>>>>> <p>No MON connection</p> >>>>>>> <pre id="traceback">Traceback (most recent >>>> call last): >>>>>>> File >>>>>> >>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", >>>> line 670, >>>>>>> in respond >>>>>>> response.body = self.handler() >>>>>>> File >>>>>> >>>> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", >>>> line >>>>>>> 217, in __call__ >>>>>>> self.body = self.oldhandler(*args, **kwargs) >>>>>>> File >>>>>> >>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", >>>> line 61, >>>>>>> in __call__ >>>>>>> return self.callable(*self.args, **self.kwargs) >>>>>>> File "/usr/lib/ceph/mgr/prometheus/module.py", >>>> line 704, in >>>>>> metrics >>>>>>> return self._metrics(instance) >>>>>>> File "/usr/lib/ceph/mgr/prometheus/module.py", >>>> line 721, in >>>>>> _metrics >>>>>>> raise cherrypy.HTTPError(503, 'No MON connection') >>>>>>> HTTPError: (503, 'No MON connection') >>>>>>> </pre> >>>>>>> <div id="powered_by"> >>>>>>> <span> >>>>>>> Powered by <a >>>> href="http://www.cherrypy.org">CherryPy >>>>>> 3.5.0</a> >>>>>>> </span> >>>>>>> </div> >>>>>>> </body> >>>>>>> </html> >>>>>>> >>>>>>> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi >>>> <pchoi@xxxxxxx <mailto:pchoi@xxxxxxx> >>>>>> <mailto:pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>>> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> We are running Mimic 13.2.8 with our cluster, and since >>>>>> upgrading to >>>>>>>> 13.2.8 the Prometheus plugin seems to hang a lot. >>>> It used to >>>>>> respond under >>>>>>>> 10s but now it often hangs. Restarting the mgr >>>> processes helps >>>>>> temporarily >>>>>>>> but within minutes it gets stuck again. >>>>>>>> >>>>>>>> The active mgr doesn't exit when doing `systemctl stop >>>>>> ceph-mgr.target" >>>>>>>> and needs to >>>>>>>> be kill -9'ed. >>>>>>>> >>>>>>>> Is there anything I can do to address this issue, >>>> or at least >>>>>> get better >>>>>>>> visibility into the issue? >>>>>>>> >>>>>>>> We only have a few plugins enabled: >>>>>>>> $ ceph mgr module ls >>>>>>>> { >>>>>>>> "enabled_modules": [ >>>>>>>> "balancer", >>>>>>>> "prometheus", >>>>>>>> "zabbix" >>>>>>>> ], >>>>>>>> >>>>>>>> 3 mgr processes, but it's a pretty large cluster >>>> (near 4000 >>>>>> OSDs) and it's >>>>>>>> a busy one with lots of rebalancing. (I don't know >>>> if a busy >>>>>> cluster would >>>>>>>> seriously affect the mgr's performance, but just >>>> throwing it >>>>>> out there) >>>>>>>> >>>>>>>> services: >>>>>>>> mon: 5 daemons, quorum >>>>>>>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1 >>>>>>>> mgr: woodenbox2(active), standbys: woodenbox0, >>>> woodenbox1 >>>>>>>> mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 >>>>>> up:standby-replay >>>>>>>> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs >>>>>>>> rgw: 4 daemons active >>>>>>>> >>>>>>>> Thanks in advance for your help, >>>>>>>> >>>>>>>> -Paul Choi >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> <mailto:ceph-users@xxxxxxx> >>>>>> <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>> >>>>>>> To unsubscribe send an email to >>>> ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> >>>>>> <mailto:ceph-users-leave@xxxxxxx >>>> <mailto:ceph-users-leave@xxxxxxx>> >>>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> <mailto:ceph-users@xxxxxxx> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> <mailto:ceph-users-leave@xxxxxxx> >>>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx