Just a follow up 24h later, and the mgr's seem to be far more stable, and have had no issues or weirdness after disabling the balancer module.
Which isn't great, because the balancer plays an important role, but after fighting distribution for a few weeks and getting it 'good enough' I'm taking the stability.
Just wanted to follow up with another 2¢.
Reed
Just to further piggyback,
Probably the most "hard" the mgr seems to get pushed is when the balancer is engaged. When trying to eval a pool or cluster, it takes upwards of 30-120 seconds for it to score it, and then another 30-120 seconds to execute the plan, and it never seems to engage automatically.
$ time ceph balancer status { "active": true, "plans": [], "mode": "upmap" }
real 0m36.490s user 0m0.259s sys 0m0.044s
I'm going to disable mine as well, and see if I can stop waking up to 'No Active MGR.' <PastedGraphic-2.png>
You can see when I lose mgr's because RBD image stats go to 0 until I catch it.
Thanks,
Reed
Hi Reed, Lenz, John I've just tried disabling the balancer, so far ceph-mgr is keeping it's CPU mostly under 20%, even with both the iostat and dashboard back on. # ceph balancer off was [root@ceph-s1 backup]# ceph balancer status { "active": true, "plans": [], "mode": "upmap" } now [root@ceph-s1 backup]# ceph balancer status { "active": false, "plans": [], "mode": "upmap" } We are using 8:2 erasure encoding across 324 12TB OSD, plus 4 NVMe OSD for a replicated cephfs metadata pool. let me know if the balancer is your problem too... best, Jake On 8/27/19 3:57 PM, Jake Grimmett wrote: Yes, the problem still occurs with the dashboard disabled...
Possibly relevant, when both the dashboard and iostat plugins are disabled, I occasionally see ceph-mgr rise to 100% CPU.
as suggested by John Hearns, the output of gstack ceph-mgr when at 100% is here:
http://p.ip.fi/52sV
many thanks
Jake
On 8/27/19 3:09 PM, Reed Dier wrote:
I'm currently seeing this with the dashboard disabled.
My instability decreases, but isn't wholly cured, by disabling prometheus and rbd_support, which I use in tandem, as the only thing I'm using the prom-exporter for is the per-rbd metrics.
ceph mgr module ls { "enabled_modules": [ "diskprediction_local", "influx", "iostat", "prometheus", "rbd_support", "restful", "telemetry" ],
I'm on Ubuntu 18.04, so that doesn't corroborate with some possible OS correlation.
Thanks,
Reed
On Aug 27, 2019, at 8:37 AM, Lenz Grimmer <lgrimmer@xxxxxxxx <mailto:lgrimmer@xxxxxxxx>> wrote:
Hi Jake,
On 8/27/19 3:22 PM, Jake Grimmett wrote:
That exactly matches what I'm seeing:
when iostat is working OK, I see ~5% CPU use by ceph-mgr and when iostat freezes, ceph-mgr CPU increases to 100%
Does this also occur if the dashboard module is disabled? Just wondering if this is isolatable to the iostat module. Thanks!
Lenz
-- SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg GF: Felix Imendörffer, HRB 247165 (AG Nürnberg)
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- MRC Laboratory of Molecular Biology Francis Crick Avenue, Cambridge CB2 0QH, UK.
|