Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

Jarett DeAngelis <jarett@xxxxxxxxxxxx> · Fri, 27 Mar 2020 11:51:16 -0400

I’m actually very curious how well this is performing for you as I’ve definitely not seen a deployment this large. How do you use it?

> On Mar 27, 2020, at 11:47 AM, shubjero <shubjero@xxxxxxxxx> wrote:
> 
> I've reported stability problems with ceph-mgr w/ prometheus plugin
> enabled on all versions we ran in production which were several
> versions of Luminous and Mimic. Our solution was to disable the
> prometheus exporter. I am using Zabbix instead. Our cluster is 1404
> OSD's in size with about 9PB raw with around 35% utilization.
> 
> On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
> <janek.bevendorff@xxxxxxxxxxxxx> wrote:
>> 
>> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
>> failing constantly due to the prometheus module doing something funny.
>> 
>> 
>> On 26/03/2020 18:10, Paul Choi wrote:
>>> I won't speculate more into the MDS's stability, but I do wonder about
>>> the same thing.
>>> There is one file served by the MDS that would cause the ceph-fuse
>>> client to hang. It was a file that many people in the company relied
>>> on for data updates, so very noticeable. The only fix was to fail over
>>> the MDS.
>>> 
>>> Since the free disk space dropped, I haven't heard anyone complain...
>>> <shrug>
>>> 
>>> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
>>> <janek.bevendorff@xxxxxxxxxxxxx
>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
>>> 
>>>    If there is actually a connection, then it's no wonder our MDS
>>>    kept crashing. Our Ceph has 9.2PiB of available space at the moment.
>>> 
>>> 
>>>    On 26/03/2020 17:32, Paul Choi wrote:
>>>>    I can't quite explain what happened, but the Prometheus endpoint
>>>>    became stable after the free disk space for the largest pool went
>>>>    substantially lower than 1PB.
>>>>    I wonder if there's some metric that exceeds the maximum size for
>>>>    some int, double, etc?
>>>> 
>>>>    -Paul
>>>> 
>>>>    On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>>>>    <janek.bevendorff@xxxxxxxxxxxxx
>>>>    <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
>>>> 
>>>>        I haven't seen any MGR hangs so far since I disabled the
>>>>        prometheus
>>>>        module. It seems like the module is not only slow, but kills
>>>>        the whole
>>>>        MGR when the cluster is sufficiently large, so these two
>>>>        issues are most
>>>>        likely connected. The issue has become much, much worse with
>>>>        14.2.8.
>>>> 
>>>> 
>>>>        On 23/03/2020 09:00, Janek Bevendorff wrote:
>>>>> I am running the very latest version of Nautilus. I will
>>>>        try setting up
>>>>> an external exporter today and see if that fixes anything.
>>>>        Our cluster
>>>>> is somewhat large-ish with 1248 OSDs, so I expect stat
>>>>        collection to
>>>>> take "some" time, but it definitely shouldn't crush the
>>>>        MGRs all the time.
>>>>> 
>>>>> On 21/03/2020 02:33, Paul Choi wrote:
>>>>>> Hi Janek,
>>>>>> 
>>>>>> What version of Ceph are you using?
>>>>>> We also have a much smaller cluster running Nautilus, with
>>>>        no MDS. No
>>>>>> Prometheus issues there.
>>>>>> I won't speculate further than this but perhaps Nautilus
>>>>        doesn't have
>>>>>> the same issue as Mimic?
>>>>>> 
>>>>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>>>>>> <janek.bevendorff@xxxxxxxxxxxxx
>>>>        <mailto:janek.bevendorff@xxxxxxxxxxxxx>
>>>>>> <mailto:janek.bevendorff@xxxxxxxxxxxxx
>>>>        <mailto:janek.bevendorff@xxxxxxxxxxxxx>>> wrote:
>>>>>> 
>>>>>>    I think this is related to my previous post to this
>>>>        list about MGRs
>>>>>>    failing regularly and being overall quite slow to
>>>>        respond. The problem
>>>>>>    has existed before, but the new version has made it
>>>>        way worse. My MGRs
>>>>>>    keep dyring every few hours and need to be restarted.
>>>>        the Promtheus
>>>>>>    plugin works, but it's pretty slow and so is the
>>>>        dashboard.
>>>>>>    Unfortunately, nobody seems to have a solution for
>>>>        this and I
>>>>>>    wonder why
>>>>>>    not more people are complaining about this problem.
>>>>>> 
>>>>>> 
>>>>>>    On 20/03/2020 19:30, Paul Choi wrote:
>>>>>>> If I "curl http://localhost:9283/metrics"; and wait
>>>>        sufficiently long
>>>>>>> enough, I get this - says "No MON connection". But
>>>>        the mons are
>>>>>>    health and
>>>>>>> the cluster is functioning fine.
>>>>>>> That said, the mons' rocksdb sizes are fairly big
>>>>        because
>>>>>>    there's lots of
>>>>>>> rebalancing going on. The Prometheus endpoint
>>>>        hanging seems to
>>>>>>    happen
>>>>>>> regardless of the mon size anyhow.
>>>>>>> 
>>>>>>>    mon.woodenbox0 is 41 GiB >= mon_data_size_warn
>>>>        (15 GiB)
>>>>>>>    mon.woodenbox2 is 26 GiB >= mon_data_size_warn
>>>>        (15 GiB)
>>>>>>>    mon.woodenbox4 is 42 GiB >= mon_data_size_warn
>>>>        (15 GiB)
>>>>>>>    mon.woodenbox3 is 43 GiB >= mon_data_size_warn
>>>>        (15 GiB)
>>>>>>>    mon.woodenbox1 is 38 GiB >= mon_data_size_warn
>>>>        (15 GiB)
>>>>>>> 
>>>>>>> # fg
>>>>>>> curl -H "Connection: close"
>>>>        http://localhost:9283/metrics
>>>>>>> <!DOCTYPE html PUBLIC
>>>>>>> "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>>>>>> 
>>>>        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>>>>>>> <html>
>>>>>>> <head>
>>>>>>>    <meta http-equiv="Content-Type" content="text/html;
>>>>>>> charset=utf-8"></meta>
>>>>>>>    <title>503 Service Unavailable</title>
>>>>>>>    <style type="text/css">
>>>>>>>    #powered_by {
>>>>>>>        margin-top: 20px;
>>>>>>>        border-top: 2px solid black;
>>>>>>>        font-style: italic;
>>>>>>>    }
>>>>>>> 
>>>>>>>    #traceback {
>>>>>>>        color: red;
>>>>>>>    }
>>>>>>>    </style>
>>>>>>> </head>
>>>>>>>    <body>
>>>>>>>        <h2>503 Service Unavailable</h2>
>>>>>>>        <p>No MON connection</p>
>>>>>>>        <pre id="traceback">Traceback (most recent
>>>>        call last):
>>>>>>>  File
>>>>>> 
>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
>>>>        line 670,
>>>>>>> in respond
>>>>>>>    response.body = self.handler()
>>>>>>>  File
>>>>>> 
>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py",
>>>>        line
>>>>>>> 217, in __call__
>>>>>>>    self.body = self.oldhandler(*args, **kwargs)
>>>>>>>  File
>>>>>> 
>>>>         "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
>>>>        line 61,
>>>>>>> in __call__
>>>>>>>    return self.callable(*self.args, **self.kwargs)
>>>>>>>  File "/usr/lib/ceph/mgr/prometheus/module.py",
>>>>        line 704, in
>>>>>>    metrics
>>>>>>>    return self._metrics(instance)
>>>>>>>  File "/usr/lib/ceph/mgr/prometheus/module.py",
>>>>        line 721, in
>>>>>>    _metrics
>>>>>>>    raise cherrypy.HTTPError(503, 'No MON connection')
>>>>>>> HTTPError: (503, 'No MON connection')
>>>>>>> </pre>
>>>>>>>    <div id="powered_by">
>>>>>>>      <span>
>>>>>>>        Powered by <a
>>>>        href="http://www.cherrypy.org";>CherryPy
>>>>>>    3.5.0</a>
>>>>>>>      </span>
>>>>>>>    </div>
>>>>>>>    </body>
>>>>>>> </html>
>>>>>>> 
>>>>>>> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi
>>>>        <pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>
>>>>>>    <mailto:pchoi@xxxxxxx <mailto:pchoi@xxxxxxx>>> wrote:
>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> We are running Mimic 13.2.8 with our cluster, and since
>>>>>>    upgrading to
>>>>>>>> 13.2.8 the Prometheus plugin seems to hang a lot.
>>>>        It used to
>>>>>>    respond under
>>>>>>>> 10s but now it often hangs. Restarting the mgr
>>>>        processes helps
>>>>>>    temporarily
>>>>>>>> but within minutes it gets stuck again.
>>>>>>>> 
>>>>>>>> The active mgr doesn't exit when doing `systemctl stop
>>>>>>    ceph-mgr.target"
>>>>>>>> and needs to
>>>>>>>> be kill -9'ed.
>>>>>>>> 
>>>>>>>> Is there anything I can do to address this issue,
>>>>        or at least
>>>>>>    get better
>>>>>>>> visibility into the issue?
>>>>>>>> 
>>>>>>>> We only have a few plugins enabled:
>>>>>>>> $ ceph mgr module ls
>>>>>>>> {
>>>>>>>>    "enabled_modules": [
>>>>>>>>        "balancer",
>>>>>>>>        "prometheus",
>>>>>>>>        "zabbix"
>>>>>>>>    ],
>>>>>>>> 
>>>>>>>> 3 mgr processes, but it's a pretty large cluster
>>>>        (near 4000
>>>>>>    OSDs) and it's
>>>>>>>> a busy one with lots of rebalancing. (I don't know
>>>>        if a busy
>>>>>>    cluster would
>>>>>>>> seriously affect the mgr's performance, but just
>>>>        throwing it
>>>>>>    out there)
>>>>>>>> 
>>>>>>>>  services:
>>>>>>>>    mon: 5 daemons, quorum
>>>>>>>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>>>>>>>>    mgr: woodenbox2(active), standbys: woodenbox0,
>>>>        woodenbox1
>>>>>>>>    mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1
>>>>>>    up:standby-replay
>>>>>>>>    osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
>>>>>>>>    rgw: 4 daemons active
>>>>>>>> 
>>>>>>>> Thanks in advance for your help,
>>>>>>>> 
>>>>>>>> -Paul Choi
>>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>        <mailto:ceph-users@xxxxxxx>
>>>>>>    <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>>>>>>> To unsubscribe send an email to
>>>>        ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>>>>>>    <mailto:ceph-users-leave@xxxxxxx
>>>>        <mailto:ceph-users-leave@xxxxxxx>>
>>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>        <mailto:ceph-users@xxxxxxx>
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>        <mailto:ceph-users-leave@xxxxxxx>
>>>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx