Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Mon, 23 Mar 2020 18:06:25 +0100

I dug up this issue report, where the problem has been reported before:
https://tracker.ceph.com/issues/39264

Unfortuantely, the issue hasn't got much (or any) attention yet. So
let's get this fixed, the prometheus module is unusable in its current
state.

On 23/03/2020 17:50, Janek Bevendorff wrote:
> I haven't seen any MGR hangs so far since I disabled the prometheus
> module. It seems like the module is not only slow, but kills the whole
> MGR when the cluster is sufficiently large, so these two issues are most
> likely connected. The issue has become much, much worse with 14.2.8.
>
>
> On 23/03/2020 09:00, Janek Bevendorff wrote:
>> I am running the very latest version of Nautilus. I will try setting up
>> an external exporter today and see if that fixes anything. Our cluster
>> is somewhat large-ish with 1248 OSDs, so I expect stat collection to
>> take "some" time, but it definitely shouldn't crush the MGRs all the time.
>>
>> On 21/03/2020 02:33, Paul Choi wrote:
>>> Hi Janek,
>>>
>>> What version of Ceph are you using?
>>> We also have a much smaller cluster running Nautilus, with no MDS. No
>>> Prometheus issues there.
>>> I won't speculate further than this but perhaps Nautilus doesn't have
>>> the same issue as Mimic?
>>>
>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>>> <janek.bevendorff@xxxxxxxxxxxxx
>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
>>>
>>>     I think this is related to my previous post to this list about MGRs
>>>     failing regularly and being overall quite slow to respond. The problem
>>>     has existed before, but the new version has made it way worse. My MGRs
>>>     keep dyring every few hours and need to be restarted. the Promtheus
>>>     plugin works, but it's pretty slow and so is the dashboard.
>>>     Unfortunately, nobody seems to have a solution for this and I
>>>     wonder why
>>>     not more people are complaining about this problem.
>>>
>>>
>>>     On 20/03/2020 19:30, Paul Choi wrote:
>>>     > If I "curl http://localhost:9283/metrics"; and wait sufficiently long
>>>     > enough, I get this - says "No MON connection". But the mons are
>>>     health and
>>>     > the cluster is functioning fine.
>>>     > That said, the mons' rocksdb sizes are fairly big because
>>>     there's lots of
>>>     > rebalancing going on. The Prometheus endpoint hanging seems to
>>>     happen
>>>     > regardless of the mon size anyhow.
>>>     >
>>>     >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
>>>     >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
>>>     >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
>>>     >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
>>>     >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
>>>     >
>>>     > # fg
>>>     > curl -H "Connection: close" http://localhost:9283/metrics
>>>     > <!DOCTYPE html PUBLIC
>>>     > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>>     > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>>>     > <html>
>>>     > <head>
>>>     >     <meta http-equiv="Content-Type" content="text/html;
>>>     > charset=utf-8"></meta>
>>>     >     <title>503 Service Unavailable</title>
>>>     >     <style type="text/css">
>>>     >     #powered_by {
>>>     >         margin-top: 20px;
>>>     >         border-top: 2px solid black;
>>>     >         font-style: italic;
>>>     >     }
>>>     >
>>>     >     #traceback {
>>>     >         color: red;
>>>     >     }
>>>     >     </style>
>>>     > </head>
>>>     >     <body>
>>>     >         <h2>503 Service Unavailable</h2>
>>>     >         <p>No MON connection</p>
>>>     >         <pre id="traceback">Traceback (most recent call last):
>>>     >   File
>>>     "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
>>>     > in respond
>>>     >     response.body = self.handler()
>>>     >   File
>>>     "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>>>     > 217, in __call__
>>>     >     self.body = self.oldhandler(*args, **kwargs)
>>>     >   File
>>>     "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
>>>     > in __call__
>>>     >     return self.callable(*self.args, **self.kwargs)
>>>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
>>>     metrics
>>>     >     return self._metrics(instance)
>>>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
>>>     _metrics
>>>     >     raise cherrypy.HTTPError(503, 'No MON connection')
>>>     > HTTPError: (503, 'No MON connection')
>>>     > </pre>
>>>     >     <div id="powered_by">
>>>     >       <span>
>>>     >         Powered by <a href="http://www.cherrypy.org";>CherryPy
>>>     3.5.0</a>
>>>     >       </span>
>>>     >     </div>
>>>     >     </body>
>>>     > </html>
>>>     >
>>>     > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi@xxxxxxx
>>>     <mailto:pchoi@xxxxxxx>> wrote:
>>>     >
>>>     >> Hello,
>>>     >>
>>>     >> We are running Mimic 13.2.8 with our cluster, and since
>>>     upgrading to
>>>     >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to
>>>     respond under
>>>     >> 10s but now it often hangs. Restarting the mgr processes helps
>>>     temporarily
>>>     >> but within minutes it gets stuck again.
>>>     >>
>>>     >> The active mgr doesn't exit when doing `systemctl stop
>>>     ceph-mgr.target"
>>>     >> and needs to
>>>     >>  be kill -9'ed.
>>>     >>
>>>     >> Is there anything I can do to address this issue, or at least
>>>     get better
>>>     >> visibility into the issue?
>>>     >>
>>>     >> We only have a few plugins enabled:
>>>     >> $ ceph mgr module ls
>>>     >> {
>>>     >>     "enabled_modules": [
>>>     >>         "balancer",
>>>     >>         "prometheus",
>>>     >>         "zabbix"
>>>     >>     ],
>>>     >>
>>>     >> 3 mgr processes, but it's a pretty large cluster (near 4000
>>>     OSDs) and it's
>>>     >> a busy one with lots of rebalancing. (I don't know if a busy
>>>     cluster would
>>>     >> seriously affect the mgr's performance, but just throwing it
>>>     out there)
>>>     >>
>>>     >>   services:
>>>     >>     mon: 5 daemons, quorum
>>>     >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>>>     >>     mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
>>>     >>     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1
>>>     up:standby-replay
>>>     >>     osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
>>>     >>     rgw: 4 daemons active
>>>     >>
>>>     >> Thanks in advance for your help,
>>>     >>
>>>     >> -Paul Choi
>>>     >>
>>>     > _______________________________________________
>>>     > ceph-users mailing list -- ceph-users@xxxxxxx
>>>     <mailto:ceph-users@xxxxxxx>
>>>     > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>     <mailto:ceph-users-leave@xxxxxxx>
>>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx