Re: mgr hangs with upmap balancer

Eugen Block <eblock@xxxxxx> · Fri, 22 Nov 2019 13:34:43 +0000

Hi,

we have also been facing some problems with MGR, we had to switch off  
balancer and pg_autoscaler because the active MGR would end up using a  
whole CPU, resulting in hanging dashboard and ceph commands. There are  
several similar threads on the ML, e.g. [1] and [2].

I'm not aware of a solution yet so I'll stick with disabled balancer  
for now since the current pg placement is fine.

Regards,
Eugen

[1] https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg56994.html
[2] https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg56890.html

Zitat von Bryan Stillwell <bstillwell@xxxxxxxxxxx>:

On multiple clusters we are seeing the mgr hang frequently when the  
balancer is enabled.  It seems that the balancer is getting caught  
in some kind of infinite loop which chews up all the CPU for the mgr  
which causes problems with other modules like prometheus (we don't  
have the devicehealth module enabled yet).

I've been able to reproduce the issue doing an offline balance as  
well using the osdmaptool:

osdmaptool --debug-osd 10 osd.map --upmap balance-upmaps.sh  
--upmap-pool default.rgw.buckets.data --upmap-max 100

It seems to loop over the same group of PGs of ~7,000 PGs over and  
over again like this without finding any new upmaps that can be added:

2019-11-19 16:39:11.131518 7f85a156f300 10  trying 24.d91
2019-11-19 16:39:11.138035 7f85a156f300 10  trying 24.2e3c
2019-11-19 16:39:11.144162 7f85a156f300 10  trying 24.176b
2019-11-19 16:39:11.149671 7f85a156f300 10  trying 24.ac6
2019-11-19 16:39:11.155115 7f85a156f300 10  trying 24.2cb2
2019-11-19 16:39:11.160508 7f85a156f300 10  trying 24.129c
2019-11-19 16:39:11.166287 7f85a156f300 10  trying 24.181f
2019-11-19 16:39:11.171737 7f85a156f300 10  trying 24.3cb1
2019-11-19 16:39:11.177260 7f85a156f300 10  24.2177 already has  
pg_upmap_items [368,271]
2019-11-19 16:39:11.177268 7f85a156f300 10  trying 24.2177
2019-11-19 16:39:11.182590 7f85a156f300 10  trying 24.a4
2019-11-19 16:39:11.188053 7f85a156f300 10  trying 24.2583
2019-11-19 16:39:11.193545 7f85a156f300 10  24.93e already has  
pg_upmap_items [80,27]
2019-11-19 16:39:11.193553 7f85a156f300 10  trying 24.93e
2019-11-19 16:39:11.198858 7f85a156f300 10  trying 24.e67
2019-11-19 16:39:11.204224 7f85a156f300 10  trying 24.16d9
2019-11-19 16:39:11.209844 7f85a156f300 10  trying 24.11dc
2019-11-19 16:39:11.215303 7f85a156f300 10  trying 24.1f3d
2019-11-19 16:39:11.221074 7f85a156f300 10  trying 24.2a57

While this cluster is running Luminous (12.2.12), I've reproduced  
the loop using the same osdmap on Nautilus (14.2.4).  Is there  
somewhere I can privately upload the osdmap for someone to  
troubleshoot the problem?

Thanks,
Bryan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx