Re: RESEND: Re: PG Balancer Upmap mode not working

David Zafman <dzafman@xxxxxxxxxx> · Thu, 12 Dec 2019 08:52:12 -0800

Phillippe,

    Maybe you can try the "crush-compat" balancer mode instead of 
"upmap" until the new code is released.

David

On 12/11/19 9:36 PM, Philippe D'Anjou wrote:

Hi,
I see your code balanced my ssdpool to about 146 each.
I can confirm this did NOT happen.
The state it ended up in is:

  0   ssd 3.49219  1.00000 3.5 TiB 797 GiB 757 GiB  36 GiB 3.9 GiB 2.7 
TiB 22.30 0.31 147     up
  1   ssd 3.49219  1.00000 3.5 TiB 803 GiB 751 GiB  49 GiB 3.7 GiB 2.7 
TiB 22.47 0.31 146     up
  2   ssd 3.49219  1.00000 3.5 TiB 818 GiB 764 GiB  51 GiB 3.8 GiB 2.7 
TiB 22.89 0.32 150     up
  3   ssd 3.49219  1.00000 3.5 TiB 794 GiB 757 GiB  34 GiB 3.2 GiB 2.7 
TiB 22.21 0.31 146     up
  4   ssd 3.49219  1.00000 3.5 TiB 837 GiB 798 GiB  34 GiB 4.4 GiB 2.7 
TiB 23.39 0.32 156     up
  6   ssd 3.49219  1.00000 3.5 TiB 790 GiB 751 GiB  35 GiB 3.6 GiB 2.7 
TiB 22.09 0.31 146     up
  8   ssd 3.49219  1.00000 3.5 TiB 874 GiB 831 GiB  40 GiB 3.5 GiB 2.6 
TiB 24.44 0.34 156     up
 10   ssd 3.49219  1.00000 3.5 TiB 807 GiB 761 GiB  43 GiB 3.4 GiB 2.7 
TiB 22.58 0.31 146     up
  5   ssd 3.49219  1.00000 3.5 TiB 744 GiB 708 GiB  32 GiB 4.2 GiB 2.8 
TiB 20.81 0.29 141     up
  7   ssd 3.49219  1.00000 3.5 TiB 732 GiB 690 GiB  39 GiB 3.2 GiB 2.8 
TiB 20.48 0.28 136     up
  9   ssd 3.49219  1.00000 3.5 TiB 702 GiB 657 GiB  42 GiB 3.9 GiB 2.8 
TiB 19.64 0.27 131     up
 11   ssd 3.49219  1.00000 3.5 TiB 805 GiB 781 GiB  22 GiB 2.3 GiB 2.7 
TiB 22.50 0.31 138     up
101   ssd 3.49219  1.00000 3.5 TiB 835 GiB 793 GiB  38 GiB 3.7 GiB 2.7 
TiB 23.36 0.32 146     up
103   ssd 3.49219  1.00000 3.5 TiB 846 GiB 803 GiB  40 GiB 3.3 GiB 2.7 
TiB 23.67 0.33 150     up
105   ssd 3.49219  1.00000 3.5 TiB 800 GiB 762 GiB  36 GiB 2.5 GiB 2.7 
TiB 22.38 0.31 148     up
107   ssd 3.49219  1.00000 3.5 TiB 843 GiB 790 GiB  49 GiB 3.4 GiB 2.7 
TiB 23.58 0.33 147     up
100   ssd 3.49219  1.00000 3.5 TiB 804 GiB 753 GiB  48 GiB 2.6 GiB 2.7 
TiB 22.47 0.31 144     up
102   ssd 3.49219  1.00000 3.5 TiB 752 GiB 737 GiB  13 GiB 2.4 GiB 2.8 
TiB 21.02 0.29 141     up
104   ssd 3.49219  1.00000 3.5 TiB 805 GiB 771 GiB  31 GiB 2.8 GiB 2.7 
TiB 22.50 0.31 144     up
106   ssd 3.49219  1.00000 3.5 TiB 793 GiB 724 GiB  66 GiB 2.9 GiB 2.7 
TiB 22.17 0.31 143     up
108   ssd 3.49219  1.00000 3.5 TiB 816 GiB 778 GiB  36 GiB 2.7 GiB 2.7 
TiB 22.83 0.32 156     up
109   ssd 3.49219  1.00000 3.5 TiB 811 GiB 763 GiB  45 GiB 2.8 GiB 2.7 
TiB 22.68 0.31 146     up
110   ssd 3.49219  1.00000 3.5 TiB 863 GiB 832 GiB  28 GiB 2.5 GiB 2.6 
TiB 24.13 0.33 154     up
111   ssd 3.49219  1.00000 3.5 TiB 784 GiB 737 GiB  45 GiB 2.7 GiB 2.7 
TiB 21.92 0.30 146     up

It did not try to balance any further. Someone said he had the same issue.
I am pretty sure it will also not balance out the HDDs as neatly as 
you got it there. There is definitely an issue somewhere, so far 3 
people telling the same story. I never had this issue under Luminous 
but im fighting with it since 4 months on 2 clusters. One got upgraded 
to Nautilus and the other one (the one the pastes are from) is a fresh 
14.2.4 one.

Any ideas on that?

Thanks
Am Donnerstag, 12. Dezember 2019, 02:09:33 OEZ hat David Zafman 
<dzafman@xxxxxxxxxx> Folgendes geschrieben:

Philippe,

I have a master branch version of the code to test.  The nautilus
backport https://github.com/ceph/ceph/pull/31956 
<https://github.com/ceph/ceph/pull/31956 >should be the same.

Using your OSDMap, the code in master branch and some additional changes
to osdmaptool I was able to balance your cluster.  The osdmaptool
changes simulate the mgr active balancer behavior.  It never took no
more than 0.13991 seconds to calculate more upmaps per round. And that's
on a virtual machine used for development. It took 35 rounds with 10
maximum upmaps per crush rule set of pools per round. With the default
1 minute sleeps inside the mgr it would take 35 minutes. Obviously,
recovery/backfill has to finish before the cluster settles into the new
configuration.  It needed 397 additional upmaps and removed 8.

Because all pools for a given crush rule are balanced together you can
see that this is more balanced than Rich's configuration uising Luminous.

This balancer code is subject to change before final release of the next
Nautilus point release.

Final layout:

osd.0 pgs 146
osd.1 pgs 146
osd.2 pgs 146
osd.3 pgs 146
osd.4 pgs 146
osd.5 pgs 146
osd.6 pgs 146
osd.7 pgs 146
osd.8 pgs 146
osd.9 pgs 146
osd.10 pgs 146
osd.11 pgs 146
osd.12 pgs 74
osd.13 pgs 74
osd.14 pgs 73
osd.15 pgs 74
osd.16 pgs 74
osd.17 pgs 74
osd.18 pgs 73
osd.19 pgs 74
osd.20 pgs 73
osd.21 pgs 73
osd.22 pgs 74
osd.23 pgs 73
osd.24 pgs 73
osd.25 pgs 75
osd.26 pgs 74
osd.27 pgs 74
osd.28 pgs 73
osd.29 pgs 73
osd.30 pgs 73
osd.31 pgs 73
osd.32 pgs 74
osd.33 pgs 73
osd.34 pgs 73
osd.35 pgs 74
osd.36 pgs 74
osd.37 pgs 74
osd.38 pgs 74
osd.39 pgs 74
osd.40 pgs 73
osd.41 pgs 73
osd.42 pgs 73
osd.43 pgs 73
osd.44 pgs 74
osd.45 pgs 73
osd.46 pgs 73
osd.47 pgs 73
osd.48 pgs 73
osd.49 pgs 73
osd.50 pgs 73
osd.51 pgs 73
osd.52 pgs 75
osd.53 pgs 59
osd.54 pgs 74
osd.55 pgs 74
osd.56 pgs 74
osd.57 pgs 73
osd.58 pgs 74
osd.59 pgs 74
osd.60 pgs 74
osd.61 pgs 74
osd.62 pgs 73
osd.63 pgs 74
osd.64 pgs 73
osd.65 pgs 74
osd.66 pgs 74
osd.67 pgs 74
osd.68 pgs 73
osd.69 pgs 74
osd.70 pgs 73
osd.71 pgs 73
osd.72 pgs 73
osd.73 pgs 73
osd.74 pgs 73
osd.75 pgs 73
osd.76 pgs 73
osd.77 pgs 73
osd.78 pgs 73

osd.79 pgs 73
osd.80 pgs 73
osd.81 pgs 73
osd.82 pgs 73
osd.83 pgs 73
osd.84 pgs 73
osd.85 pgs 73
osd.86 pgs 73
osd.87 pgs 73
osd.88 pgs 73
osd.89 pgs 73
osd.90 pgs 73
osd.91 pgs 73
osd.92 pgs 73
osd.93 pgs 73
osd.94 pgs 73
osd.95 pgs 73
osd.96 pgs 73
osd.97 pgs 73
osd.98 pgs 73
osd.99 pgs 73
osd.100 pgs 146
osd.101 pgs 146
osd.102 pgs 146
osd.103 pgs 146
osd.104 pgs 146
osd.105 pgs 146
osd.106 pgs 146
osd.107 pgs 146
osd.108 pgs 146
osd.109 pgs 146
osd.110 pgs 146
osd.111 pgs 146
osd.112 pgs 73
osd.113 pgs 73
osd.114 pgs 73
osd.115 pgs 73
osd.116 pgs 73
osd.117 pgs 73
osd.118 pgs 73
osd.119 pgs 73
osd.120 pgs 73
osd.121 pgs 73
osd.122 pgs 73
osd.123 pgs 73
osd.124 pgs 73
osd.125 pgs 73
osd.126 pgs 73
osd.127 pgs 74
osd.128 pgs 73
osd.129 pgs 73
osd.130 pgs 73
osd.131 pgs 73
osd.132 pgs 73
osd.133 pgs 73
osd.134 pgs 73
osd.135 pgs 73

David

On 12/10/19 9:59 PM, Philippe D'Anjou wrote:
> Given I was told its an issue of too low PGs I am raising and testing
> this, although my SSDs which have about 150 each also are not well
> distributed.
> I attached my OSDmap, I'd appreciate if you could run your test on it
> like you did with the other guy, so I know if this will ever
> distribute equally or not..
>
> If you're busy I understand that too, then ignore this.
>
> Thanks in either case. Just have been dealing with this since months
> now and it gets frustrating.
>
> best regards
>
> Am Dienstag, 10. Dezember 2019, 03:53:17 OEZ hat David Zafman
> <dzafman@xxxxxxxxxx <mailto:dzafman@xxxxxxxxxx>> Folgendes geschrieben:
>
>
>
> Please file a tracker with the symptom and examples. Please attach your
> OSDMap (ceph osd getmap > osdmap.bin).
>
> Note that https://github.com/ceph/ceph/pull/31956 
<https://github.com/ceph/ceph/pull/31956 >
> <https://github.com/ceph/ceph/pull/31956 
<https://github.com/ceph/ceph/pull/31956 >>has the Nautilus
> version of improved upmap code.  It also changes osdmaptool to match the
> mgr behavior, so that one can observe the behavior of the upmap balancer
> offline.
>
> Thanks
>
> David
>
> On 12/8/19 11:04 AM, Philippe D'Anjou wrote:
> > It's only getting worse after raising PGs now.
> >
> > Anything between:
> >  96   hdd 9.09470 1.00000 9.1 TiB 4.9 TiB 4.9 TiB 97 KiB  13 GiB 4.2
> > TiB 53.62 0.76  54     up
> >
> > and
> >
> >  89   hdd 9.09470 1.00000 9.1 TiB 8.1 TiB 8.1 TiB 88 KiB  21 GiB 1001
> > GiB 89.25 1.27  87     up
> >
> > How is that possible? I dont know how much more proof I need to
> > present that there's a bug.
>
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> 
<mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>

> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx