Please read this [1], all about Balancer with upmap mode.On a medium sized cluster with device-classes, I am experiencing a problem with the SSD pool: root at adminnode:~# ceph osd df | grep ssd ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 2 ssd 0.43700 1.00000 447GiB 254GiB 193GiB 56.77 1.28 50 3 ssd 0.43700 1.00000 447GiB 208GiB 240GiB 46.41 1.04 58 4 ssd 0.43700 1.00000 447GiB 266GiB 181GiB 59.44 1.34 55 30 ssd 0.43660 1.00000 447GiB 222GiB 225GiB 49.68 1.12 49 6 ssd 0.43700 1.00000 447GiB 238GiB 209GiB 53.28 1.20 59 7 ssd 0.43700 1.00000 447GiB 228GiB 220GiB 50.88 1.14 56 8 ssd 0.43700 1.00000 447GiB 269GiB 178GiB 60.16 1.35 57 31 ssd 0.43660 1.00000 447GiB 231GiB 217GiB 51.58 1.16 56 34 ssd 0.43660 1.00000 447GiB 186GiB 261GiB 41.65 0.94 49 36 ssd 0.87329 1.00000 894GiB 364GiB 530GiB 40.68 0.92 91 37 ssd 0.87329 1.00000 894GiB 321GiB 573GiB 35.95 0.81 78 42 ssd 0.87329 1.00000 894GiB 375GiB 519GiB 41.91 0.94 92 43 ssd 0.87329 1.00000 894GiB 438GiB 456GiB 49.00 1.10 92 13 ssd 0.43700 1.00000 447GiB 249GiB 198GiB 55.78 1.25 72 14 ssd 0.43700 1.00000 447GiB 290GiB 158GiB 64.76 1.46 71 15 ssd 0.43700 1.00000 447GiB 368GiB 78.6GiB 82.41 1.85 78 <---- 16 ssd 0.43700 1.00000 447GiB 253GiB 194GiB 56.66 1.27 70 19 ssd 0.43700 1.00000 447GiB 269GiB 178GiB 60.21 1.35 70 20 ssd 0.43700 1.00000 447GiB 312GiB 135GiB 69.81 1.57 77 21 ssd 0.43700 1.00000 447GiB 312GiB 135GiB 69.77 1.57 77 22 ssd 0.43700 1.00000 447GiB 269GiB 178GiB 60.10 1.35 67 38 ssd 0.43660 1.00000 447GiB 153GiB 295GiB 34.11 0.77 46 39 ssd 0.43660 1.00000 447GiB 127GiB 320GiB 28.37 0.64 38 40 ssd 0.87329 1.00000 894GiB 386GiB 508GiB 43.17 0.97 97 41 ssd 0.87329 1.00000 894GiB 375GiB 520GiB 41.88 0.94 113 This leads to just 1.2TB free space (some GBs away from NEAR_FULL pool). Currently, the balancer plugin is off because it immediately crashed the MGR in the past (on 12.2.5). Since then I upgraded to 12.2.8 but did not re-enable the balancer. [I am unable to find the bugtracker ID] Would the balancer plugin correct this situation? What happens if all MGRs die like they did on 12.2.5 because of the plugin? Will the balancer take data from the most-unbalanced OSDs first? Otherwise the OSD may fill up more then FULL which would cause the whole pool to freeze (because the smallest OSD is taken into account for free space calculation). This would be the worst case as over 100 VMs would freeze, causing lot of trouble. This is also the reason I did not try to enable the balancer again. It's stable from 12.2.8 with upmap mode.
k [1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-December/032002.html |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com