Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

shubjero <shubjero@xxxxxxxxx> · Fri, 27 Mar 2020 11:47:10 -0400

I've reported stability problems with ceph-mgr w/ prometheus plugin
enabled on all versions we ran in production which were several
versions of Luminous and Mimic. Our solution was to disable the
prometheus exporter. I am using Zabbix instead. Our cluster is 1404
OSD's in size with about 9PB raw with around 35% utilization.

On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
>
> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
> failing constantly due to the prometheus module doing something funny.
>
>
> On 26/03/2020 18:10, Paul Choi wrote:
> > I won't speculate more into the MDS's stability, but I do wonder about
> > the same thing.
> > There is one file served by the MDS that would cause the ceph-fuse
> > client to hang. It was a file that many people in the company relied
> > on for data updates, so very noticeable. The only fix was to fail over
> > the MDS.
> >
> > Since the free disk space dropped, I haven't heard anyone complain...
> > <shrug>
> >
> > On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
> > <janek.bevendorff@xxxxxxxxxxxxx
> > <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
> >
> >     If there is actually a connection, then it's no wonder our MDS
> >     kept crashing. Our Ceph has 9.2PiB of available space at the moment.
> >
> >
> >     On 26/03/2020 17:32, Paul Choi wrote:
> >>     I can't quite explain what happened, but the Prometheus endpoint
> >>     became stable after the free disk space for the largest pool went
> >>     substantially lower than 1PB.
> >>     I wonder if there's some metric that exceeds the maximum size for
> >>     some int, double, etc?
> >>
> >>     -Paul
> >>
> >>     On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
> >>     <janek.bevendorff@xxxxxxxxxxxxx
> >>     <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
> >>
> >>         I haven't seen any MGR hangs so far since I disabled the
> >>         prometheus
> >>         module. It seems like the module is not only slow, but kills
> >>         the whole
> >>         MGR when the cluster is sufficiently large, so these two
> >>         issues are most
> >>         likely connected. The issue has become much, much worse with
> >>         14.2.8.
> >>
> >>
> >>         On 23/03/2020 09:00, Janek Bevendorff wrote:
> >>         > I am running the very latest version of Nautilus. I will
> >>         try setting up
> >>         > an external exporter today and see if that fixes anything.
> >>         Our cluster
> >>         > is somewhat large-ish with 1248 OSDs, so I expect stat
> >>         collection to
> >>         > take "some" time, but it definitely shouldn't crush the
> >>         MGRs all the time.
> >>         >
> >>         > On 21/03/2020 02:33, Paul Choi wrote:
> >>         >> Hi Janek,
> >>         >>
> >>         >> What version of Ceph are you using?
> >>         >> We also have a much smaller cluster running Nautilus, with
> >>         no MDS. No
> >>         >> Prometheus issues there.
> >>         >> I won't speculate further than this but perhaps Nautilus
> >>         doesn't have
> >>         >> the same issue as Mimic?
> >>         >>
> >>         >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
> >>         >> <janek.bevendorff@xxxxxxxxxxxxx
> >>         <mailto:janek.bevendorff@xxxxxxxxxxxxx>
> >>         >> <mailto:janek.bevendorff@xxxxxxxxxxxxx
> >>         <mailto:janek.bevendorff@xxxxxxxxxxxxx>>> wrote:
> >>         >>
> >>         >>     I think this is related to my previous post to this
> >>         list about MGRs
> >>         >>     failing regularly and being overall quite slow to
> >>         respond. The problem
> >>         >>     has existed before, but the new version has made it
> >>         way worse. My MGRs
> >>         >>     keep dyring every few hours and need to be restarted.
> >>         the Promtheus
> >>         >>     plugin works, but it's pretty slow and so is the
> >>         dashboard.
> >>         >>     Unfortunately, nobody seems to have a solution for
> >>         this and I
> >>         >>     wonder why
> >>         >>     not more people are complaining about this problem.
> >>         >>
> >>         >>
> >>         >>     On 20/03/2020 19:30, Paul Choi wrote:
> >>         >>     > If I "curl http://localhost:9283/metrics"; and wait
> >>         sufficiently long
> >>         >>     > enough, I get this - says "No MON connection". But
> >>         the mons are
> >>         >>     health and
> >>         >>     > the cluster is functioning fine.
> >>         >>     > That said, the mons' rocksdb sizes are fairly big
> >>         because
> >>         >>     there's lots of
> >>         >>     > rebalancing going on. The Prometheus endpoint
> >>         hanging seems to
> >>         >>     happen
> >>         >>     > regardless of the mon size anyhow.
> >>         >>     >
> >>         >>     >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn
> >>         (15 GiB)
> >>         >>     >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn
> >>         (15 GiB)
> >>         >>     >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn
> >>         (15 GiB)
> >>         >>     >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn
> >>         (15 GiB)
> >>         >>     >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn
> >>         (15 GiB)
> >>         >>     >
> >>         >>     > # fg
> >>         >>     > curl -H "Connection: close"
> >>         http://localhost:9283/metrics
> >>         >>     > <!DOCTYPE html PUBLIC
> >>         >>     > "-//W3C//DTD XHTML 1.0 Transitional//EN"
> >>         >>     >
> >>         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> >>         >>     > <html>
> >>         >>     > <head>
> >>         >>     >     <meta http-equiv="Content-Type" content="text/html;
> >>         >>     > charset=utf-8"></meta>
> >>         >>     >     <title>503 Service Unavailable</title>
> >>         >>     >     <style type="text/css">
> >>         >>     >     #powered_by {
> >>         >>     >         margin-top: 20px;
> >>         >>     >         border-top: 2px solid black;
> >>         >>     >         font-style: italic;
> >>         >>     >     }
> >>         >>     >
> >>         >>     >     #traceback {
> >>         >>     >         color: red;
> >>         >>     >     }
> >>         >>     >     </style>
> >>         >>     > </head>
> >>         >>     >     <body>
> >>         >>     >         <h2>503 Service Unavailable</h2>
> >>         >>     >         <p>No MON connection</p>
> >>         >>     >         <pre id="traceback">Traceback (most recent
> >>         call last):
> >>         >>     >   File
> >>         >>
> >>          "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
> >>         line 670,
> >>         >>     > in respond
> >>         >>     >     response.body = self.handler()
> >>         >>     >   File
> >>         >>
> >>          "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
> >>         line
> >>         >>     > 217, in __call__
> >>         >>     >     self.body = self.oldhandler(*args, **kwargs)
> >>         >>     >   File
> >>         >>
> >>          "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
> >>         line 61,
> >>         >>     > in __call__
> >>         >>     >     return self.callable(*self.args, **self.kwargs)
> >>         >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py",
> >>         line 704, in
> >>         >>     metrics
> >>         >>     >     return self._metrics(instance)
> >>         >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py",
> >>         line 721, in
> >>         >>     _metrics
> >>         >>     >     raise cherrypy.HTTPError(503, 'No MON connection')
> >>         >>     > HTTPError: (503, 'No MON connection')
> >>         >>     > </pre>
> >>         >>     >     <div id="powered_by">
> >>         >>     >       <span>
> >>         >>     >         Powered by <a
> >>         href="http://www.cherrypy.org";>CherryPy
> >>         >>     3.5.0</a>
> >>         >>     >       </span>
> >>         >>     >     </div>
> >>         >>     >     </body>
> >>         >>     > </html>
> >>         >>     >
> >>         >>     > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi
> >>         <pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>
> >>         >>     <mailto:pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>>> wrote:
> >>         >>     >
> >>         >>     >> Hello,
> >>         >>     >>
> >>         >>     >> We are running Mimic 13.2.8 with our cluster, and since
> >>         >>     upgrading to
> >>         >>     >> 13.2.8 the Prometheus plugin seems to hang a lot.
> >>         It used to
> >>         >>     respond under
> >>         >>     >> 10s but now it often hangs. Restarting the mgr
> >>         processes helps
> >>         >>     temporarily
> >>         >>     >> but within minutes it gets stuck again.
> >>         >>     >>
> >>         >>     >> The active mgr doesn't exit when doing `systemctl stop
> >>         >>     ceph-mgr.target"
> >>         >>     >> and needs to
> >>         >>     >>  be kill -9'ed.
> >>         >>     >>
> >>         >>     >> Is there anything I can do to address this issue,
> >>         or at least
> >>         >>     get better
> >>         >>     >> visibility into the issue?
> >>         >>     >>
> >>         >>     >> We only have a few plugins enabled:
> >>         >>     >> $ ceph mgr module ls
> >>         >>     >> {
> >>         >>     >>     "enabled_modules": [
> >>         >>     >>         "balancer",
> >>         >>     >>         "prometheus",
> >>         >>     >>         "zabbix"
> >>         >>     >>     ],
> >>         >>     >>
> >>         >>     >> 3 mgr processes, but it's a pretty large cluster
> >>         (near 4000
> >>         >>     OSDs) and it's
> >>         >>     >> a busy one with lots of rebalancing. (I don't know
> >>         if a busy
> >>         >>     cluster would
> >>         >>     >> seriously affect the mgr's performance, but just
> >>         throwing it
> >>         >>     out there)
> >>         >>     >>
> >>         >>     >>   services:
> >>         >>     >>     mon: 5 daemons, quorum
> >>         >>     >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
> >>         >>     >>     mgr: woodenbox2(active), standbys: woodenbox0,
> >>         woodenbox1
> >>         >>     >>     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1
> >>         >>     up:standby-replay
> >>         >>     >>     osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
> >>         >>     >>     rgw: 4 daemons active
> >>         >>     >>
> >>         >>     >> Thanks in advance for your help,
> >>         >>     >>
> >>         >>     >> -Paul Choi
> >>         >>     >>
> >>         >>     > _______________________________________________
> >>         >>     > ceph-users mailing list -- ceph-users@xxxxxxx
> >>         <mailto:ceph-users@xxxxxxx>
> >>         >>     <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
> >>         >>     > To unsubscribe send an email to
> >>         ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
> >>         >>     <mailto:ceph-users-leave@xxxxxxx
> >>         <mailto:ceph-users-leave@xxxxxxx>>
> >>         >>
> >>         > _______________________________________________
> >>         > ceph-users mailing list -- ceph-users@xxxxxxx
> >>         <mailto:ceph-users@xxxxxxx>
> >>         > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>         <mailto:ceph-users-leave@xxxxxxx>
> >>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx