Re: mgr hangs with upmap balancer

Bryan Stillwell <bstillwell@xxxxxxxxxxx> · Fri, 22 Nov 2019 16:22:49 +0000

Thanks Eugen,

I created this bug report to track the issue if you want to watch it:

https://tracker.ceph.com/issues/42971

Bryan

> On Nov 22, 2019, at 6:34 AM, Eugen Block <eblock@xxxxxx> wrote:
> 
> Notice: This email is from an external sender.
> 
> 
> 
> Hi,
> 
> we have also been facing some problems with MGR, we had to switch off
> balancer and pg_autoscaler because the active MGR would end up using a
> whole CPU, resulting in hanging dashboard and ceph commands. There are
> several similar threads on the ML, e.g. [1] and [2].
> 
> I'm not aware of a solution yet so I'll stick with disabled balancer
> for now since the current pg placement is fine.
> 
> Regards,
> Eugen
> 
> 
> [1] https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg56994.html
> [2] https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg56890.html
> 
> Zitat von Bryan Stillwell <bstillwell@xxxxxxxxxxx>:
> 
>> On multiple clusters we are seeing the mgr hang frequently when the
>> balancer is enabled.  It seems that the balancer is getting caught
>> in some kind of infinite loop which chews up all the CPU for the mgr
>> which causes problems with other modules like prometheus (we don't
>> have the devicehealth module enabled yet).
>> 
>> I've been able to reproduce the issue doing an offline balance as
>> well using the osdmaptool:
>> 
>> osdmaptool --debug-osd 10 osd.map --upmap balance-upmaps.sh
>> --upmap-pool default.rgw.buckets.data --upmap-max 100
>> 
>> It seems to loop over the same group of PGs of ~7,000 PGs over and
>> over again like this without finding any new upmaps that can be added:
>> 
>> 2019-11-19 16:39:11.131518 7f85a156f300 10  trying 24.d91
>> 2019-11-19 16:39:11.138035 7f85a156f300 10  trying 24.2e3c
>> 2019-11-19 16:39:11.144162 7f85a156f300 10  trying 24.176b
>> 2019-11-19 16:39:11.149671 7f85a156f300 10  trying 24.ac6
>> 2019-11-19 16:39:11.155115 7f85a156f300 10  trying 24.2cb2
>> 2019-11-19 16:39:11.160508 7f85a156f300 10  trying 24.129c
>> 2019-11-19 16:39:11.166287 7f85a156f300 10  trying 24.181f
>> 2019-11-19 16:39:11.171737 7f85a156f300 10  trying 24.3cb1
>> 2019-11-19 16:39:11.177260 7f85a156f300 10  24.2177 already has
>> pg_upmap_items [368,271]
>> 2019-11-19 16:39:11.177268 7f85a156f300 10  trying 24.2177
>> 2019-11-19 16:39:11.182590 7f85a156f300 10  trying 24.a4
>> 2019-11-19 16:39:11.188053 7f85a156f300 10  trying 24.2583
>> 2019-11-19 16:39:11.193545 7f85a156f300 10  24.93e already has
>> pg_upmap_items [80,27]
>> 2019-11-19 16:39:11.193553 7f85a156f300 10  trying 24.93e
>> 2019-11-19 16:39:11.198858 7f85a156f300 10  trying 24.e67
>> 2019-11-19 16:39:11.204224 7f85a156f300 10  trying 24.16d9
>> 2019-11-19 16:39:11.209844 7f85a156f300 10  trying 24.11dc
>> 2019-11-19 16:39:11.215303 7f85a156f300 10  trying 24.1f3d
>> 2019-11-19 16:39:11.221074 7f85a156f300 10  trying 24.2a57
>> 
>> 
>> While this cluster is running Luminous (12.2.12), I've reproduced
>> the loop using the same osdmap on Nautilus (14.2.4).  Is there
>> somewhere I can privately upload the osdmap for someone to
>> troubleshoot the problem?
>> 
>> Thanks,
>> Bryan
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx