Hi! I did what you wrote but my MGRs started to crash again: root@adminnode:~# ceph -s cluster: id: 086d9f80-6249-4594-92d0-e31b6aaaaa9c health: HEALTH_WARN no active mgr 105498/6277782 objects misplaced (1.680%) services: mon: 3 daemons, quorum mon01,mon02,mon03 mgr: no daemons active osd: 44 osds: 43 up, 43 in data: pools: 4 pools, 1616 pgs objects: 1.88M objects, 7.07TiB usage: 13.2TiB used, 16.7TiB / 29.9TiB avail pgs: 105498/6277782 objects misplaced (1.680%) 1606 active+clean 8 active+remapped+backfill_wait 2 active+remapped+backfilling io: client: 5.51MiB/s rd, 3.38MiB/s wr, 33op/s rd, 317op/s wr recovery: 60.3MiB/s, 15objects/s MON 1 log: -13> 2019-01-04 14:05:04.432186 7fec56a93700 4 mgr ms_dispatch active mgrdigest v1 -12> 2019-01-04 14:05:04.432194 7fec56a93700 4 mgr ms_dispatch mgrdigest v1 -11> 2019-01-04 14:05:04.822041 7fec434e1700 4 mgr[balancer] Optimize plan auto_2019-01-04_14:05:04 -10> 2019-01-04 14:05:04.822170 7fec434e1700 4 mgr get_config get_configkey: mgr/balancer/mode -9> 2019-01-04 14:05:04.822231 7fec434e1700 4 mgr get_config get_configkey: mgr/balancer/max_misplaced -8> 2019-01-04 14:05:04.822268 7fec434e1700 4 ceph_config_get max_misplaced not found -7> 2019-01-04 14:05:04.822444 7fec434e1700 4 mgr[balancer] Mode upmap, max misplaced 0.050000 -6> 2019-01-04 14:05:04.822849 7fec434e1700 4 mgr[balancer] do_upmap -5> 2019-01-04 14:05:04.822923 7fec434e1700 4 mgr get_config get_configkey: mgr/balancer/upmap_max_iterations -4> 2019-01-04 14:05:04.822964 7fec434e1700 4 ceph_config_get upmap_max_iterations not found -3> 2019-01-04 14:05:04.823013 7fec434e1700 4 mgr get_config get_configkey: mgr/balancer/upmap_max_deviation -2> 2019-01-04 14:05:04.823048 7fec434e1700 4 ceph_config_get upmap_max_deviation not found -1> 2019-01-04 14:05:04.823265 7fec434e1700 4 mgr[balancer] pools ['rbd_vms_hdd', 'rbd_vms_ssd', 'rbd_vms_ssd_01', 'rbd_vms_ssd_01_ec'] 0> 2019-01-04 14:05:04.836124 7fec434e1700 -1 /build/ceph-12.2.8/src/osd/OSDMap.cc: In function 'int OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set<long int>&, OSDMap::Incremental*)' thread 7fec434e1700 time 2019-01-04 14:05:04.832885 /build/ceph-12.2.8/src/osd/OSDMap.cc: 4102: FAILED assert(target > 0) ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x558c3c0bb572] 2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1] 3: (()+0x2f3020) [0x558c3bf5d020] 4: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971] 5: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 6: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d] 7: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 8: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 10: (()+0x13e370) [0x7fec5e8be370] 11: (PyObject_Call()+0x43) [0x7fec5e891273] 12: (()+0x1853ac) [0x7fec5e9053ac] 13: (PyObject_Call()+0x43) [0x7fec5e891273] 14: (PyObject_CallMethod()+0xf4) [0x7fec5e892444] 15: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c] 16: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998] 17: (()+0x76ba) [0x7fec5d74c6ba] 18: (clone()+0x6d) [0x7fec5c7b841d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mgr.mon01.ceph01.srvfarm.net.log --- end dump of recent events --- 2019-01-04 14:05:05.032479 7fec434e1700 -1 *** Caught signal (Aborted) ** in thread 7fec434e1700 thread_name:balancer ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable) 1: (()+0x4105b4) [0x558c3c07a5b4] 2: (()+0x11390) [0x7fec5d756390] 3: (gsignal()+0x38) [0x7fec5c6e6428] 4: (abort()+0x16a) [0x7fec5c6e802a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x558c3c0bb6fe] 6: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1] 7: (()+0x2f3020) [0x558c3bf5d020] 8: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971] 9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 10: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d] 11: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 12: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 13: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 14: (()+0x13e370) [0x7fec5e8be370] 15: (PyObject_Call()+0x43) [0x7fec5e891273] 16: (()+0x1853ac) [0x7fec5e9053ac] 17: (PyObject_Call()+0x43) [0x7fec5e891273] 18: (PyObject_CallMethod()+0xf4) [0x7fec5e892444] 19: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c] 20: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998] 21: (()+0x76ba) [0x7fec5d74c6ba] 22: (clone()+0x6d) [0x7fec5c7b841d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2019-01-04 14:05:05.032479 7fec434e1700 -1 *** Caught signal (Aborted) ** in thread 7fec434e1700 thread_name:balancer ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable) 1: (()+0x4105b4) [0x558c3c07a5b4] 2: (()+0x11390) [0x7fec5d756390] 3: (gsignal()+0x38) [0x7fec5c6e6428] 4: (abort()+0x16a) [0x7fec5c6e802a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x558c3c0bb6fe] 6: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1] 7: (()+0x2f3020) [0x558c3bf5d020] 8: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971] 9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 10: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d] 11: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 12: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044] 13: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c] 14: (()+0x13e370) [0x7fec5e8be370] 15: (PyObject_Call()+0x43) [0x7fec5e891273] 16: (()+0x1853ac) [0x7fec5e9053ac] 17: (PyObject_Call()+0x43) [0x7fec5e891273] 18: (PyObject_CallMethod()+0xf4) [0x7fec5e892444] 19: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c] 20: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998] 21: (()+0x76ba) [0x7fec5d74c6ba] 22: (clone()+0x6d) [0x7fec5c7b841d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mgr.mon01.ceph01.srvfarm.net.log --- end dump of recent events --- Kevin Am Mi., 2. Jan. 2019 um 17:35 Uhr schrieb Konstantin Shalygin <k0ste@xxxxxxxx>: > > On a medium sized cluster with device-classes, I am experiencing a > problem with the SSD pool: > > root at adminnode:~# ceph osd df | grep ssd > ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS > 2 ssd 0.43700 1.00000 447GiB 254GiB 193GiB 56.77 1.28 50 > 3 ssd 0.43700 1.00000 447GiB 208GiB 240GiB 46.41 1.04 58 > 4 ssd 0.43700 1.00000 447GiB 266GiB 181GiB 59.44 1.34 55 > 30 ssd 0.43660 1.00000 447GiB 222GiB 225GiB 49.68 1.12 49 > 6 ssd 0.43700 1.00000 447GiB 238GiB 209GiB 53.28 1.20 59 > 7 ssd 0.43700 1.00000 447GiB 228GiB 220GiB 50.88 1.14 56 > 8 ssd 0.43700 1.00000 447GiB 269GiB 178GiB 60.16 1.35 57 > 31 ssd 0.43660 1.00000 447GiB 231GiB 217GiB 51.58 1.16 56 > 34 ssd 0.43660 1.00000 447GiB 186GiB 261GiB 41.65 0.94 49 > 36 ssd 0.87329 1.00000 894GiB 364GiB 530GiB 40.68 0.92 91 > 37 ssd 0.87329 1.00000 894GiB 321GiB 573GiB 35.95 0.81 78 > 42 ssd 0.87329 1.00000 894GiB 375GiB 519GiB 41.91 0.94 92 > 43 ssd 0.87329 1.00000 894GiB 438GiB 456GiB 49.00 1.10 92 > 13 ssd 0.43700 1.00000 447GiB 249GiB 198GiB 55.78 1.25 72 > 14 ssd 0.43700 1.00000 447GiB 290GiB 158GiB 64.76 1.46 71 > 15 ssd 0.43700 1.00000 447GiB 368GiB 78.6GiB 82.41 1.85 78 <---- > 16 ssd 0.43700 1.00000 447GiB 253GiB 194GiB 56.66 1.27 70 > 19 ssd 0.43700 1.00000 447GiB 269GiB 178GiB 60.21 1.35 70 > 20 ssd 0.43700 1.00000 447GiB 312GiB 135GiB 69.81 1.57 77 > 21 ssd 0.43700 1.00000 447GiB 312GiB 135GiB 69.77 1.57 77 > 22 ssd 0.43700 1.00000 447GiB 269GiB 178GiB 60.10 1.35 67 > 38 ssd 0.43660 1.00000 447GiB 153GiB 295GiB 34.11 0.77 46 > 39 ssd 0.43660 1.00000 447GiB 127GiB 320GiB 28.37 0.64 38 > 40 ssd 0.87329 1.00000 894GiB 386GiB 508GiB 43.17 0.97 97 > 41 ssd 0.87329 1.00000 894GiB 375GiB 520GiB 41.88 0.94 113 > > This leads to just 1.2TB free space (some GBs away from NEAR_FULL pool). > Currently, the balancer plugin is off because it immediately crashed > the MGR in the past (on 12.2.5). > Since then I upgraded to 12.2.8 but did not re-enable the balancer. [I > am unable to find the bugtracker ID] > > Would the balancer plugin correct this situation? > What happens if all MGRs die like they did on 12.2.5 because of the plugin? > Will the balancer take data from the most-unbalanced OSDs first? > Otherwise the OSD may fill up more then FULL which would cause the > whole pool to freeze (because the smallest OSD is taken into account > for free space calculation). > This would be the worst case as over 100 VMs would freeze, causing lot > of trouble. This is also the reason I did not try to enable the > balancer again. > > Please read this [1], all about Balancer with upmap mode. > > It's stable from 12.2.8 with upmap mode. > > > > k > > [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-December/032002.html _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com