Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Wed, 1 Apr 2020 11:30:35 +0200

> I’m actually very curious how well this is performing for you as I’ve definitely not seen a deployment this large. How do you use it?

What exactly do you mean? Our cluster has 11PiB capacity of which about
15% are used at the moment (web-scale corpora and such). We have
deployed 5 MONs and 5MGRs (both on the same hosts) and it works totally
fine overall. We have some MDS performance issues here and there, but
that's not too bad anymore after a few upstream patches and then we have
this annoying Prometheus MGR problem, which kills our MGRs reliably
after a few hours.

>
>> On Mar 27, 2020, at 11:47 AM, shubjero <shubjero@xxxxxxxxx> wrote:
>>
>> I've reported stability problems with ceph-mgr w/ prometheus plugin
>> enabled on all versions we ran in production which were several
>> versions of Luminous and Mimic. Our solution was to disable the
>> prometheus exporter. I am using Zabbix instead. Our cluster is 1404
>> OSD's in size with about 9PB raw with around 35% utilization.
>>
>> On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
>> <janek.bevendorff@xxxxxxxxxxxxx> wrote:
>>> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
>>> failing constantly due to the prometheus module doing something funny.
>>>
>>>
>>> On 26/03/2020 18:10, Paul Choi wrote:
>>>> I won't speculate more into the MDS's stability, but I do wonder about
>>>> the same thing.
>>>> There is one file served by the MDS that would cause the ceph-fuse
>>>> client to hang. It was a file that many people in the company relied
>>>> on for data updates, so very noticeable. The only fix was to fail over
>>>> the MDS.
>>>>
>>>> Since the free disk space dropped, I haven't heard anyone complain...
>>>> <shrug>
>>>>
>>>> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
>>>> <janek.bevendorff@xxxxxxxxxxxxx
>>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
>>>>
>>>>    If there is actually a connection, then it's no wonder our MDS
>>>>    kept crashing. Our Ceph has 9.2PiB of available space at the moment.
>>>>
>>>>
>>>>    On 26/03/2020 17:32, Paul Choi wrote:
>>>>>    I can't quite explain what happened, but the Prometheus endpoint
>>>>>    became stable after the free disk space for the largest pool went
>>>>>    substantially lower than 1PB.
>>>>>    I wonder if there's some metric that exceeds the maximum size for
>>>>>    some int, double, etc?
>>>>>
>>>>>    -Paul
>>>>>
>>>>>    On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>>>>>    <janek.bevendorff@xxxxxxxxxxxxx
>>>>>    <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
>>>>>
>>>>>        I haven't seen any MGR hangs so far since I disabled the
>>>>>        prometheus
>>>>>        module. It seems like the module is not only slow, but kills
>>>>>        the whole
>>>>>        MGR when the cluster is sufficiently large, so these two
>>>>>        issues are most
>>>>>        likely connected. The issue has become much, much worse with
>>>>>        14.2.8.
>>>>>
>>>>>
>>>>>        On 23/03/2020 09:00, Janek Bevendorff wrote:
>>>>>> I am running the very latest version of Nautilus. I will
>>>>>        try setting up
>>>>>> an external exporter today and see if that fixes anything.
>>>>>        Our cluster
>>>>>> is somewhat large-ish with 1248 OSDs, so I expect stat
>>>>>        collection to
>>>>>> take "some" time, but it definitely shouldn't crush the
>>>>>        MGRs all the time.
>>>>>> On 21/03/2020 02:33, Paul Choi wrote:
>>>>>>> Hi Janek,
>>>>>>>
>>>>>>> What version of Ceph are you using?
>>>>>>> We also have a much smaller cluster running Nautilus, with
>>>>>        no MDS. No
>>>>>>> Prometheus issues there.
>>>>>>> I won't speculate further than this but perhaps Nautilus
>>>>>        doesn't have
>>>>>>> the same issue as Mimic?
>>>>>>>
>>>>>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>>>>>>> <janek.bevendorff@xxxxxxxxxxxxx
>>>>>        <mailto:janek.bevendorff@xxxxxxxxxxxxx>
>>>>>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx
>>>>>        <mailto:janek.bevendorff@xxxxxxxxxxxxx>>> wrote:
>>>>>>>    I think this is related to my previous post to this
>>>>>        list about MGRs
>>>>>>>    failing regularly and being overall quite slow to
>>>>>        respond. The problem
>>>>>>>    has existed before, but the new version has made it
>>>>>        way worse. My MGRs
>>>>>>>    keep dyring every few hours and need to be restarted.
>>>>>        the Promtheus
>>>>>>>    plugin works, but it's pretty slow and so is the
>>>>>        dashboard.
>>>>>>>    Unfortunately, nobody seems to have a solution for
>>>>>        this and I
>>>>>>>    wonder why
>>>>>>>    not more people are complaining about this problem.
>>>>>>>
>>>>>>>
>>>>>>>    On 20/03/2020 19:30, Paul Choi wrote:
>>>>>>>> If I "curl http://localhost:9283/metrics"; and wait
>>>>>        sufficiently long
>>>>>>>> enough, I get this - says "No MON connection". But
>>>>>        the mons are
>>>>>>>    health and
>>>>>>>> the cluster is functioning fine.
>>>>>>>> That said, the mons' rocksdb sizes are fairly big
>>>>>        because
>>>>>>>    there's lots of
>>>>>>>> rebalancing going on. The Prometheus endpoint
>>>>>        hanging seems to
>>>>>>>    happen
>>>>>>>> regardless of the mon size anyhow.
>>>>>>>>
>>>>>>>>    mon.woodenbox0 is 41 GiB >= mon_data_size_warn
>>>>>        (15 GiB)
>>>>>>>>    mon.woodenbox2 is 26 GiB >= mon_data_size_warn
>>>>>        (15 GiB)
>>>>>>>>    mon.woodenbox4 is 42 GiB >= mon_data_size_warn
>>>>>        (15 GiB)
>>>>>>>>    mon.woodenbox3 is 43 GiB >= mon_data_size_warn
>>>>>        (15 GiB)
>>>>>>>>    mon.woodenbox1 is 38 GiB >= mon_data_size_warn
>>>>>        (15 GiB)
>>>>>>>> # fg
>>>>>>>> curl -H "Connection: close"
>>>>>        http://localhost:9283/metrics
>>>>>>>> <!DOCTYPE html PUBLIC
>>>>>>>> "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>>>>>>>
>>>>>        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>>>>>>>> <html>
>>>>>>>> <head>
>>>>>>>>    <meta http-equiv="Content-Type" content="text/html;
>>>>>>>> charset=utf-8"></meta>
>>>>>>>>    <title>503 Service Unavailable</title>
>>>>>>>>    <style type="text/css">
>>>>>>>>    #powered_by {
>>>>>>>>        margin-top: 20px;
>>>>>>>>        border-top: 2px solid black;
>>>>>>>>        font-style: italic;
>>>>>>>>    }
>>>>>>>>
>>>>>>>>    #traceback {
>>>>>>>>        color: red;
>>>>>>>>    }
>>>>>>>>    </style>
>>>>>>>> </head>
>>>>>>>>    <body>
>>>>>>>>        <h2>503 Service Unavailable</h2>
>>>>>>>>        <p>No MON connection</p>
>>>>>>>>        <pre id="traceback">Traceback (most recent
>>>>>        call last):
>>>>>>>>  File
>>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
>>>>>        line 670,
>>>>>>>> in respond
>>>>>>>>    response.body = self.handler()
>>>>>>>>  File
>>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
>>>>>        line
>>>>>>>> 217, in __call__
>>>>>>>>    self.body = self.oldhandler(*args, **kwargs)
>>>>>>>>  File
>>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>>>>>        line 61,
>>>>>>>> in __call__
>>>>>>>>    return self.callable(*self.args, **self.kwargs)
>>>>>>>>  File "/usr/lib/ceph/mgr/prometheus/module.py",
>>>>>        line 704, in
>>>>>>>    metrics
>>>>>>>>    return self._metrics(instance)
>>>>>>>>  File "/usr/lib/ceph/mgr/prometheus/module.py",
>>>>>        line 721, in
>>>>>>>    _metrics
>>>>>>>>    raise cherrypy.HTTPError(503, 'No MON connection')
>>>>>>>> HTTPError: (503, 'No MON connection')
>>>>>>>> </pre>
>>>>>>>>    <div id="powered_by">
>>>>>>>>      <span>
>>>>>>>>        Powered by <a
>>>>>        href="http://www.cherrypy.org";>CherryPy
>>>>>>>    3.5.0</a>
>>>>>>>>      </span>
>>>>>>>>    </div>
>>>>>>>>    </body>
>>>>>>>> </html>
>>>>>>>>
>>>>>>>> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi
>>>>>        <pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>
>>>>>>>    <mailto:pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>>> wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> We are running Mimic 13.2.8 with our cluster, and since
>>>>>>>    upgrading to
>>>>>>>>> 13.2.8 the Prometheus plugin seems to hang a lot.
>>>>>        It used to
>>>>>>>    respond under
>>>>>>>>> 10s but now it often hangs. Restarting the mgr
>>>>>        processes helps
>>>>>>>    temporarily
>>>>>>>>> but within minutes it gets stuck again.
>>>>>>>>>
>>>>>>>>> The active mgr doesn't exit when doing `systemctl stop
>>>>>>>    ceph-mgr.target"
>>>>>>>>> and needs to
>>>>>>>>> be kill -9'ed.
>>>>>>>>>
>>>>>>>>> Is there anything I can do to address this issue,
>>>>>        or at least
>>>>>>>    get better
>>>>>>>>> visibility into the issue?
>>>>>>>>>
>>>>>>>>> We only have a few plugins enabled:
>>>>>>>>> $ ceph mgr module ls
>>>>>>>>> {
>>>>>>>>>    "enabled_modules": [
>>>>>>>>>        "balancer",
>>>>>>>>>        "prometheus",
>>>>>>>>>        "zabbix"
>>>>>>>>>    ],
>>>>>>>>>
>>>>>>>>> 3 mgr processes, but it's a pretty large cluster
>>>>>        (near 4000
>>>>>>>    OSDs) and it's
>>>>>>>>> a busy one with lots of rebalancing. (I don't know
>>>>>        if a busy
>>>>>>>    cluster would
>>>>>>>>> seriously affect the mgr's performance, but just
>>>>>        throwing it
>>>>>>>    out there)
>>>>>>>>>  services:
>>>>>>>>>    mon: 5 daemons, quorum
>>>>>>>>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>>>>>>>>>    mgr: woodenbox2(active), standbys: woodenbox0,
>>>>>        woodenbox1
>>>>>>>>>    mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1
>>>>>>>    up:standby-replay
>>>>>>>>>    osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
>>>>>>>>>    rgw: 4 daemons active
>>>>>>>>>
>>>>>>>>> Thanks in advance for your help,
>>>>>>>>>
>>>>>>>>> -Paul Choi
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>        <mailto:ceph-users@xxxxxxx>
>>>>>>>    <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>>>>>>>> To unsubscribe send an email to
>>>>>        ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>>>>>>    <mailto:ceph-users-leave@xxxxxxx
>>>>>        <mailto:ceph-users-leave@xxxxxxx>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>        <mailto:ceph-users@xxxxxxx>
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>        <mailto:ceph-users-leave@xxxxxxx>
>>>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx