Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
failing constantly due to the prometheus module doing something funny.


On 26/03/2020 18:10, Paul Choi wrote:
> I won't speculate more into the MDS's stability, but I do wonder about
> the same thing.
> There is one file served by the MDS that would cause the ceph-fuse
> client to hang. It was a file that many people in the company relied
> on for data updates, so very noticeable. The only fix was to fail over
> the MDS.
>
> Since the free disk space dropped, I haven't heard anyone complain...
> <shrug>
>
> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
> <janek.bevendorff@xxxxxxxxxxxxx
> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
>
>     If there is actually a connection, then it's no wonder our MDS
>     kept crashing. Our Ceph has 9.2PiB of available space at the moment.
>
>
>     On 26/03/2020 17:32, Paul Choi wrote:
>>     I can't quite explain what happened, but the Prometheus endpoint
>>     became stable after the free disk space for the largest pool went
>>     substantially lower than 1PB.
>>     I wonder if there's some metric that exceeds the maximum size for
>>     some int, double, etc?
>>
>>     -Paul
>>
>>     On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>>     <janek.bevendorff@xxxxxxxxxxxxx
>>     <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
>>
>>         I haven't seen any MGR hangs so far since I disabled the
>>         prometheus
>>         module. It seems like the module is not only slow, but kills
>>         the whole
>>         MGR when the cluster is sufficiently large, so these two
>>         issues are most
>>         likely connected. The issue has become much, much worse with
>>         14.2.8.
>>
>>
>>         On 23/03/2020 09:00, Janek Bevendorff wrote:
>>         > I am running the very latest version of Nautilus. I will
>>         try setting up
>>         > an external exporter today and see if that fixes anything.
>>         Our cluster
>>         > is somewhat large-ish with 1248 OSDs, so I expect stat
>>         collection to
>>         > take "some" time, but it definitely shouldn't crush the
>>         MGRs all the time.
>>         >
>>         > On 21/03/2020 02:33, Paul Choi wrote:
>>         >> Hi Janek,
>>         >>
>>         >> What version of Ceph are you using?
>>         >> We also have a much smaller cluster running Nautilus, with
>>         no MDS. No
>>         >> Prometheus issues there.
>>         >> I won't speculate further than this but perhaps Nautilus
>>         doesn't have
>>         >> the same issue as Mimic?
>>         >>
>>         >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>>         >> <janek.bevendorff@xxxxxxxxxxxxx
>>         <mailto:janek.bevendorff@xxxxxxxxxxxxx>
>>         >> <mailto:janek.bevendorff@xxxxxxxxxxxxx
>>         <mailto:janek.bevendorff@xxxxxxxxxxxxx>>> wrote:
>>         >>
>>         >>     I think this is related to my previous post to this
>>         list about MGRs
>>         >>     failing regularly and being overall quite slow to
>>         respond. The problem
>>         >>     has existed before, but the new version has made it
>>         way worse. My MGRs
>>         >>     keep dyring every few hours and need to be restarted.
>>         the Promtheus
>>         >>     plugin works, but it's pretty slow and so is the
>>         dashboard.
>>         >>     Unfortunately, nobody seems to have a solution for
>>         this and I
>>         >>     wonder why
>>         >>     not more people are complaining about this problem.
>>         >>
>>         >>
>>         >>     On 20/03/2020 19:30, Paul Choi wrote:
>>         >>     > If I "curl http://localhost:9283/metrics"; and wait
>>         sufficiently long
>>         >>     > enough, I get this - says "No MON connection". But
>>         the mons are
>>         >>     health and
>>         >>     > the cluster is functioning fine.
>>         >>     > That said, the mons' rocksdb sizes are fairly big
>>         because
>>         >>     there's lots of
>>         >>     > rebalancing going on. The Prometheus endpoint
>>         hanging seems to
>>         >>     happen
>>         >>     > regardless of the mon size anyhow.
>>         >>     >
>>         >>     >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn
>>         (15 GiB)
>>         >>     >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn
>>         (15 GiB)
>>         >>     >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn
>>         (15 GiB)
>>         >>     >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn
>>         (15 GiB)
>>         >>     >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn
>>         (15 GiB)
>>         >>     >
>>         >>     > # fg
>>         >>     > curl -H "Connection: close"
>>         http://localhost:9283/metrics
>>         >>     > <!DOCTYPE html PUBLIC
>>         >>     > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>         >>     >
>>         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>>         >>     > <html>
>>         >>     > <head>
>>         >>     >     <meta http-equiv="Content-Type" content="text/html;
>>         >>     > charset=utf-8"></meta>
>>         >>     >     <title>503 Service Unavailable</title>
>>         >>     >     <style type="text/css">
>>         >>     >     #powered_by {
>>         >>     >         margin-top: 20px;
>>         >>     >         border-top: 2px solid black;
>>         >>     >         font-style: italic;
>>         >>     >     }
>>         >>     >
>>         >>     >     #traceback {
>>         >>     >         color: red;
>>         >>     >     }
>>         >>     >     </style>
>>         >>     > </head>
>>         >>     >     <body>
>>         >>     >         <h2>503 Service Unavailable</h2>
>>         >>     >         <p>No MON connection</p>
>>         >>     >         <pre id="traceback">Traceback (most recent
>>         call last):
>>         >>     >   File
>>         >>   
>>          "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
>>         line 670,
>>         >>     > in respond
>>         >>     >     response.body = self.handler()
>>         >>     >   File
>>         >>   
>>          "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
>>         line
>>         >>     > 217, in __call__
>>         >>     >     self.body = self.oldhandler(*args, **kwargs)
>>         >>     >   File
>>         >>   
>>          "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>>         line 61,
>>         >>     > in __call__
>>         >>     >     return self.callable(*self.args, **self.kwargs)
>>         >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py",
>>         line 704, in
>>         >>     metrics
>>         >>     >     return self._metrics(instance)
>>         >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py",
>>         line 721, in
>>         >>     _metrics
>>         >>     >     raise cherrypy.HTTPError(503, 'No MON connection')
>>         >>     > HTTPError: (503, 'No MON connection')
>>         >>     > </pre>
>>         >>     >     <div id="powered_by">
>>         >>     >       <span>
>>         >>     >         Powered by <a
>>         href="http://www.cherrypy.org";>CherryPy
>>         >>     3.5.0</a>
>>         >>     >       </span>
>>         >>     >     </div>
>>         >>     >     </body>
>>         >>     > </html>
>>         >>     >
>>         >>     > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi
>>         <pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>
>>         >>     <mailto:pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>>> wrote:
>>         >>     >
>>         >>     >> Hello,
>>         >>     >>
>>         >>     >> We are running Mimic 13.2.8 with our cluster, and since
>>         >>     upgrading to
>>         >>     >> 13.2.8 the Prometheus plugin seems to hang a lot.
>>         It used to
>>         >>     respond under
>>         >>     >> 10s but now it often hangs. Restarting the mgr
>>         processes helps
>>         >>     temporarily
>>         >>     >> but within minutes it gets stuck again.
>>         >>     >>
>>         >>     >> The active mgr doesn't exit when doing `systemctl stop
>>         >>     ceph-mgr.target"
>>         >>     >> and needs to
>>         >>     >>  be kill -9'ed.
>>         >>     >>
>>         >>     >> Is there anything I can do to address this issue,
>>         or at least
>>         >>     get better
>>         >>     >> visibility into the issue?
>>         >>     >>
>>         >>     >> We only have a few plugins enabled:
>>         >>     >> $ ceph mgr module ls
>>         >>     >> {
>>         >>     >>     "enabled_modules": [
>>         >>     >>         "balancer",
>>         >>     >>         "prometheus",
>>         >>     >>         "zabbix"
>>         >>     >>     ],
>>         >>     >>
>>         >>     >> 3 mgr processes, but it's a pretty large cluster
>>         (near 4000
>>         >>     OSDs) and it's
>>         >>     >> a busy one with lots of rebalancing. (I don't know
>>         if a busy
>>         >>     cluster would
>>         >>     >> seriously affect the mgr's performance, but just
>>         throwing it
>>         >>     out there)
>>         >>     >>
>>         >>     >>   services:
>>         >>     >>     mon: 5 daemons, quorum
>>         >>     >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>>         >>     >>     mgr: woodenbox2(active), standbys: woodenbox0,
>>         woodenbox1
>>         >>     >>     mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1
>>         >>     up:standby-replay
>>         >>     >>     osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
>>         >>     >>     rgw: 4 daemons active
>>         >>     >>
>>         >>     >> Thanks in advance for your help,
>>         >>     >>
>>         >>     >> -Paul Choi
>>         >>     >>
>>         >>     > _______________________________________________
>>         >>     > ceph-users mailing list -- ceph-users@xxxxxxx
>>         <mailto:ceph-users@xxxxxxx>
>>         >>     <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>>         >>     > To unsubscribe send an email to
>>         ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>         >>     <mailto:ceph-users-leave@xxxxxxx
>>         <mailto:ceph-users-leave@xxxxxxx>>
>>         >>
>>         > _______________________________________________
>>         > ceph-users mailing list -- ceph-users@xxxxxxx
>>         <mailto:ceph-users@xxxxxxx>
>>         > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>         <mailto:ceph-users-leave@xxxxxxx>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux