Re: iostat and dashboard freezing

Reed Dier <reed.dier@xxxxxxxxxxx> · Wed, 28 Aug 2019 18:53:50 -0500

Just a follow up 24h later, and the mgr's seem to be far more stable, and have had no issues or weirdness after disabling the balancer module.
Which isn't great, because the balancer plays an important role, but after fighting distribution for a few weeks and getting it 'good enough' I'm taking the stability.

Just wanted to follow up with another 2¢.

Reed

On Aug 27, 2019, at 11:53 AM, Reed Dier <reed.dier@xxxxxxxxxxx> wrote:

Just to further piggyback,
Probably the most "hard" the mgr seems to get pushed is when the balancer is engaged.
When trying to eval a pool or cluster, it takes upwards of 30-120 seconds for it to score it, and then another 30-120 seconds to execute the plan, and it never seems to engage automatically.

$ time ceph balancer status
{
    "active": true,
    "plans": [],
    "mode": "upmap"
}

real    0m36.490s
user    0m0.259s
sys     0m0.044s

I'm going to disable mine as well, and see if I can stop waking up to 'No Active MGR.'
<PastedGraphic-2.png>

You can see when I lose mgr's because RBD image stats go to 0 until I catch it.

Thanks,

Reed

On Aug 27, 2019, at 11:24 AM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:

Hi Reed, Lenz, John

I've just tried disabling the balancer, so far ceph-mgr is keeping it's
CPU mostly under 20%, even with both the iostat and dashboard back on.

# ceph balancer off

was
[root@ceph-s1 backup]# ceph balancer status
{
    "active": true,
    "plans": [],
    "mode": "upmap"
}

now
[root@ceph-s1 backup]# ceph balancer status
{
    "active": false,
    "plans": [],
    "mode": "upmap"
}

We are using 8:2 erasure encoding across 324 12TB OSD, plus 4 NVMe OSD
for a replicated cephfs metadata pool.

let me know if the balancer is your problem too...

best,

Jake

On 8/27/19 3:57 PM, Jake Grimmett wrote:
Yes, the problem still occurs with the dashboard disabled...

Possibly relevant, when both the dashboard and iostat plugins are
disabled, I occasionally see ceph-mgr rise to 100% CPU.

as suggested by John Hearns, the output of  gstack ceph-mgr when at 100%
is here:

http://p.ip.fi/52sV

many thanks

Jake

On 8/27/19 3:09 PM, Reed Dier wrote:
I'm currently seeing this with the dashboard disabled.

My instability decreases, but isn't wholly cured, by disabling
prometheus and rbd_support, which I use in tandem, as the only thing I'm
using the prom-exporter for is the per-rbd metrics.

ceph mgr module ls
{
    "enabled_modules": [
        "diskprediction_local",
        "influx",
        "iostat",
        "prometheus",
        "rbd_support",
        "restful",
        "telemetry"
    ],

I'm on Ubuntu 18.04, so that doesn't corroborate with some possible OS
correlation.

Thanks,

Reed

On Aug 27, 2019, at 8:37 AM, Lenz Grimmer <lgrimmer@xxxxxxxx
<mailto:lgrimmer@xxxxxxxx>> wrote:

Hi Jake,

On 8/27/19 3:22 PM, Jake Grimmett wrote:

That exactly matches what I'm seeing:

when iostat is working OK, I see ~5% CPU use by ceph-mgr
and when iostat freezes, ceph-mgr CPU increases to 100%

Does this also occur if the dashboard module is disabled? Just wondering
if this is isolatable to the iostat module. Thanks!

Lenz

-- 
SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
GF: Felix Imendörffer, HRB 247165 (AG Nürnberg)

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com