Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

Paul Choi <pchoi@xxxxxxx> · Fri, 20 Mar 2020 15:33:46 -1000

Hi Janek,

What version of Ceph are you using?
We also have a much smaller cluster running Nautilus, with no MDS. No
Prometheus issues there.
I won't speculate further than this but perhaps Nautilus doesn't have the
same issue as Mimic?

On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff <
janek.bevendorff@xxxxxxxxxxxxx> wrote:

> I think this is related to my previous post to this list about MGRs
> failing regularly and being overall quite slow to respond. The problem
> has existed before, but the new version has made it way worse. My MGRs
> keep dyring every few hours and need to be restarted. the Promtheus
> plugin works, but it's pretty slow and so is the dashboard.
> Unfortunately, nobody seems to have a solution for this and I wonder why
> not more people are complaining about this problem.
>
>
> On 20/03/2020 19:30, Paul Choi wrote:
> > If I "curl http://localhost:9283/metrics"; and wait sufficiently long
> > enough, I get this - says "No MON connection". But the mons are health
> and
> > the cluster is functioning fine.
> > That said, the mons' rocksdb sizes are fairly big because there's lots of
> > rebalancing going on. The Prometheus endpoint hanging seems to happen
> > regardless of the mon size anyhow.
> >
> >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
> >
> > # fg
> > curl -H "Connection: close" http://localhost:9283/metrics
> > <!DOCTYPE html PUBLIC
> > "-//W3C//DTD XHTML 1.0 Transitional//EN"
> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> > <html>
> > <head>
> >     <meta http-equiv="Content-Type" content="text/html;
> > charset=utf-8"></meta>
> >     <title>503 Service Unavailable</title>
> >     <style type="text/css">
> >     #powered_by {
> >         margin-top: 20px;
> >         border-top: 2px solid black;
> >         font-style: italic;
> >     }
> >
> >     #traceback {
> >         color: red;
> >     }
> >     </style>
> > </head>
> >     <body>
> >         <h2>503 Service Unavailable</h2>
> >         <p>No MON connection</p>
> >         <pre id="traceback">Traceback (most recent call last):
> >   File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
> 670,
> > in respond
> >     response.body = self.handler()
> >   File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> > 217, in __call__
> >     self.body = self.oldhandler(*args, **kwargs)
> >   File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line
> 61,
> > in __call__
> >     return self.callable(*self.args, **self.kwargs)
> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics
> >     return self._metrics(instance)
> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics
> >     raise cherrypy.HTTPError(503, 'No MON connection')
> > HTTPError: (503, 'No MON connection')
> > </pre>
> >     <div id="powered_by">
> >       <span>
> >         Powered by <a href="http://www.cherrypy.org";>CherryPy 3.5.0</a>
> >       </span>
> >     </div>
> >     </body>
> > </html>
> >
> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi@xxxxxxx> wrote:
> >
> >> Hello,
> >>
> >> We are running Mimic 13.2.8 with our cluster, and since upgrading to
> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond
> under
> >> 10s but now it often hangs. Restarting the mgr processes helps
> temporarily
> >> but within minutes it gets stuck again.
> >>
> >> The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target"
> >> and needs to
> >>  be kill -9'ed.
> >>
> >> Is there anything I can do to address this issue, or at least get better
> >> visibility into the issue?
> >>
> >> We only have a few plugins enabled:
> >> $ ceph mgr module ls
> >> {
> >>     "enabled_modules": [
> >>         "balancer",
> >>         "prometheus",
> >>         "zabbix"
> >>     ],
> >>
> >> 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and
> it's
> >> a busy one with lots of rebalancing. (I don't know if a busy cluster
> would
> >> seriously affect the mgr's performance, but just throwing it out there)
> >>
> >>   services:
> >>     mon: 5 daemons, quorum
> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
> >>     mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
> >>     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1 up:standby-replay
> >>     osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
> >>     rgw: 4 daemons active
> >>
> >> Thanks in advance for your help,
> >>
> >> -Paul Choi
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx