Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Thu, 26 Mar 2020 17:43:10 +0100

If there is actually a connection, then it's no wonder our MDS kept
crashing. Our Ceph has 9.2PiB of available space at the moment.

On 26/03/2020 17:32, Paul Choi wrote:
> I can't quite explain what happened, but the Prometheus endpoint
> became stable after the free disk space for the largest pool went
> substantially lower than 1PB.
> I wonder if there's some metric that exceeds the maximum size for some
> int, double, etc?
>
> -Paul
>
> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
> <janek.bevendorff@xxxxxxxxxxxxx
> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
>
>     I haven't seen any MGR hangs so far since I disabled the prometheus
>     module. It seems like the module is not only slow, but kills the whole
>     MGR when the cluster is sufficiently large, so these two issues
>     are most
>     likely connected. The issue has become much, much worse with 14.2.8.
>
>
>     On 23/03/2020 09:00, Janek Bevendorff wrote:
>     > I am running the very latest version of Nautilus. I will try
>     setting up
>     > an external exporter today and see if that fixes anything. Our
>     cluster
>     > is somewhat large-ish with 1248 OSDs, so I expect stat collection to
>     > take "some" time, but it definitely shouldn't crush the MGRs all
>     the time.
>     >
>     > On 21/03/2020 02:33, Paul Choi wrote:
>     >> Hi Janek,
>     >>
>     >> What version of Ceph are you using?
>     >> We also have a much smaller cluster running Nautilus, with no
>     MDS. No
>     >> Prometheus issues there.
>     >> I won't speculate further than this but perhaps Nautilus
>     doesn't have
>     >> the same issue as Mimic?
>     >>
>     >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>     >> <janek.bevendorff@xxxxxxxxxxxxx
>     <mailto:janek.bevendorff@xxxxxxxxxxxxx>
>     >> <mailto:janek.bevendorff@xxxxxxxxxxxxx
>     <mailto:janek.bevendorff@xxxxxxxxxxxxx>>> wrote:
>     >>
>     >>     I think this is related to my previous post to this list
>     about MGRs
>     >>     failing regularly and being overall quite slow to respond.
>     The problem
>     >>     has existed before, but the new version has made it way
>     worse. My MGRs
>     >>     keep dyring every few hours and need to be restarted. the
>     Promtheus
>     >>     plugin works, but it's pretty slow and so is the dashboard.
>     >>     Unfortunately, nobody seems to have a solution for this and I
>     >>     wonder why
>     >>     not more people are complaining about this problem.
>     >>
>     >>
>     >>     On 20/03/2020 19:30, Paul Choi wrote:
>     >>     > If I "curl http://localhost:9283/metrics"; and wait
>     sufficiently long
>     >>     > enough, I get this - says "No MON connection". But the
>     mons are
>     >>     health and
>     >>     > the cluster is functioning fine.
>     >>     > That said, the mons' rocksdb sizes are fairly big because
>     >>     there's lots of
>     >>     > rebalancing going on. The Prometheus endpoint hanging
>     seems to
>     >>     happen
>     >>     > regardless of the mon size anyhow.
>     >>     >
>     >>     >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
>     >>     >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
>     >>     >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
>     >>     >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
>     >>     >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
>     >>     >
>     >>     > # fg
>     >>     > curl -H "Connection: close" http://localhost:9283/metrics
>     >>     > <!DOCTYPE html PUBLIC
>     >>     > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>     >>     > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>     >>     > <html>
>     >>     > <head>
>     >>     >     <meta http-equiv="Content-Type" content="text/html;
>     >>     > charset=utf-8"></meta>
>     >>     >     <title>503 Service Unavailable</title>
>     >>     >     <style type="text/css">
>     >>     >     #powered_by {
>     >>     >         margin-top: 20px;
>     >>     >         border-top: 2px solid black;
>     >>     >         font-style: italic;
>     >>     >     }
>     >>     >
>     >>     >     #traceback {
>     >>     >         color: red;
>     >>     >     }
>     >>     >     </style>
>     >>     > </head>
>     >>     >     <body>
>     >>     >         <h2>503 Service Unavailable</h2>
>     >>     >         <p>No MON connection</p>
>     >>     >         <pre id="traceback">Traceback (most recent call
>     last):
>     >>     >   File
>     >>     "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
>     line 670,
>     >>     > in respond
>     >>     >     response.body = self.handler()
>     >>     >   File
>     >>   
>      "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>     >>     > 217, in __call__
>     >>     >     self.body = self.oldhandler(*args, **kwargs)
>     >>     >   File
>     >>     "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>     line 61,
>     >>     > in __call__
>     >>     >     return self.callable(*self.args, **self.kwargs)
>     >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
>     >>     metrics
>     >>     >     return self._metrics(instance)
>     >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
>     >>     _metrics
>     >>     >     raise cherrypy.HTTPError(503, 'No MON connection')
>     >>     > HTTPError: (503, 'No MON connection')
>     >>     > </pre>
>     >>     >     <div id="powered_by">
>     >>     >       <span>
>     >>     >         Powered by <a href="http://www.cherrypy.org";>CherryPy
>     >>     3.5.0</a>
>     >>     >       </span>
>     >>     >     </div>
>     >>     >     </body>
>     >>     > </html>
>     >>     >
>     >>     > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi <pchoi@xxxxxxx
>     <mailto:pchoi@xxxxxxx>
>     >>     <mailto:pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>>> wrote:
>     >>     >
>     >>     >> Hello,
>     >>     >>
>     >>     >> We are running Mimic 13.2.8 with our cluster, and since
>     >>     upgrading to
>     >>     >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to
>     >>     respond under
>     >>     >> 10s but now it often hangs. Restarting the mgr processes
>     helps
>     >>     temporarily
>     >>     >> but within minutes it gets stuck again.
>     >>     >>
>     >>     >> The active mgr doesn't exit when doing `systemctl stop
>     >>     ceph-mgr.target"
>     >>     >> and needs to
>     >>     >>  be kill -9'ed.
>     >>     >>
>     >>     >> Is there anything I can do to address this issue, or at
>     least
>     >>     get better
>     >>     >> visibility into the issue?
>     >>     >>
>     >>     >> We only have a few plugins enabled:
>     >>     >> $ ceph mgr module ls
>     >>     >> {
>     >>     >>     "enabled_modules": [
>     >>     >>         "balancer",
>     >>     >>         "prometheus",
>     >>     >>         "zabbix"
>     >>     >>     ],
>     >>     >>
>     >>     >> 3 mgr processes, but it's a pretty large cluster (near 4000
>     >>     OSDs) and it's
>     >>     >> a busy one with lots of rebalancing. (I don't know if a busy
>     >>     cluster would
>     >>     >> seriously affect the mgr's performance, but just throwing it
>     >>     out there)
>     >>     >>
>     >>     >>   services:
>     >>     >>     mon: 5 daemons, quorum
>     >>     >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>     >>     >>     mgr: woodenbox2(active), standbys: woodenbox0,
>     woodenbox1
>     >>     >>     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1
>     >>     up:standby-replay
>     >>     >>     osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
>     >>     >>     rgw: 4 daemons active
>     >>     >>
>     >>     >> Thanks in advance for your help,
>     >>     >>
>     >>     >> -Paul Choi
>     >>     >>
>     >>     > _______________________________________________
>     >>     > ceph-users mailing list -- ceph-users@xxxxxxx
>     <mailto:ceph-users@xxxxxxx>
>     >>     <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>     >>     > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>
>     >>     <mailto:ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>>
>     >>
>     > _______________________________________________
>     > ceph-users mailing list -- ceph-users@xxxxxxx
>     <mailto:ceph-users@xxxxxxx>
>     > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx