On 9/2/19 5:47 PM, Jake Grimmett wrote:
Hi Konstantin,
To confirm, disabling the balancer allows the mgr to work properly.
I tried re-enabling the balancer, it briefly worked, then locked up the
mgr again.
Here it's working OK...
[root@ceph-s1 ~]# time ceph balancer optimize new
real 0m1.628s
user 0m0.583s
sys 0m0.075s
[root@ceph-s1 ~]# ceph balancer status
{
"active": false,
"plans": [
"new"
],
"mode": "upmap"
}
[root@ceph-s1 ~]# ceph balancer on
At this point, the balancer seems initially to be working as 'ceph -s'
shows the misplaced count going from 0 to ...
pgs: 6829497/4977639365 objects misplaced (0.137%)
However mgr now goes back up to 100% CPU, and stopping balancer is very
difficult
[root@ceph-s1 ~]# ceph balancer off
real 5m37.641s
user 0m0.751s
sys 0m0.158s
[root@ceph-s1 ~]# time ceph balancer optimize new
real 18m19.202s
user 0m1.388s
sys 0m0.413s
Here is the other data you requested:
[root@ceph-s1 ~]# ceph config-key ls | grep balance
"config-history/10/+mgr/mgr/balancer/active",
"config-history/29/+mgr/mgr/balancer/active",
"config-history/29/-mgr/mgr/balancer/active",
"config-history/30/+mgr/mgr/balancer/active",
"config-history/30/-mgr/mgr/balancer/active",
"config-history/31/+mgr/mgr/balancer/active",
"config-history/31/-mgr/mgr/balancer/active",
"config-history/32/+mgr/mgr/balancer/active",
"config-history/32/-mgr/mgr/balancer/active",
"config-history/33/+mgr/mgr/balancer/active",
"config-history/33/-mgr/mgr/balancer/active",
"config-history/9/+mgr/mgr/balancer/mode",
"config/mgr/mgr/balancer/active",
"config/mgr/mgr/balancer/mode",
We have two main pools:
pool #1 is 3x replicated, has 4 NVMe OSD and is only used for cephfs
metadata. This is on 4 nodes (that also run the mgr, mon and mds)
Pool #2 is erasure encoded 8+2, has 324 x 12TB OSD over 36 nodes, and is
the data partition for cephfs. All osd in pool 2 have a db/wal on nvme
(6 hdd per NVMe)
'ceph df detail' is here:
<http://p.ip.fi/4l4m>
'ceph osd tree' is here:
http://p.ip.fi/k1x2
'ceph osd df tree' output is here:
http://p.ip.fi/g7ma
any help appreciated,
Jake, you already have good VAR for your OSD's.
I suggest to set `mgr/balancer/default_sleep_interval` to '2', and
decrease `mgr/balancer/default_sleep_interval` to `300`.
k
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com