Hello, I'd like to ask few rebalancing and related questions. On one of my cluster, I got nearfull warning for one of OSDs. Apart from that, the cluster health was perfectly OK, all PGs active+clean. Therefore I used rebalance-by-utilization which changed weights a bit causing about 30% of data to be misplaced. After that, recovery started, but it didn't got the cluster to clean state - some pgs ended up in remapped state and even worse, some of them are left undersized. Even though I set weights to values before rebalance, it didn't help. I'd like to ask more experienced users: 1) when I have cluster with evenly distributed OSDs and weights, it happens that one of OSD suddenly gets much more filled then the others? 2) why rebalancing weights leads to undersized pgs? Is't this a bug leading to unnecessary risk of data loss? 3) why changing weights by only a little value leads to such big data transfers? I changed weight only for one OSD (out of 15) and by only little value, and it caused about 30% misplaced groups.. is this OK? 4) after some experiments, I also got few pgs stuck in stale+active+clean or creating state.. how to get rid of those? 5) last but not least, how can I help my cluster getting back to clean state? here's df tree: [root@remrprv1c ceph]# ceph osd df tree ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR TYPE NAME -8 13.54486 - 13534G 8018G 5515G 59.24 1.63 root ssd -2 4.51495 - 4511G 2869G 1641G 63.61 1.75 host remrprv1a-ssd 0 0.85999 0.38860 859G 594G 264G 69.22 1.90 osd.0 1 0.85999 0.33694 859G 557G 301G 64.89 1.78 osd.1 2 0.92999 0.44678 929G 617G 312G 66.43 1.82 osd.2 3 0.92999 0.32753 929G 580G 348G 62.46 1.71 osd.3 4 0.93500 0.31308 934G 519G 414G 55.60 1.53 osd.4 -3 4.51495 - 4511G 2595G 1915G 57.54 1.58 host remrprv1b-ssd 5 0.85999 0.31793 859G 456G 402G 53.16 1.46 osd.5 6 0.85999 0.40715 859G 502G 356G 58.47 1.60 osd.6 7 0.92999 0.38741 929G 500G 428G 53.87 1.48 osd.7 8 0.92999 0.38803 929G 607G 322G 65.30 1.79 osd.8 9 0.93500 0.36951 934G 529G 405G 56.64 1.55 osd.9 -4 4.51495 - 4511G 2552G 1958G 56.59 1.55 host remrprv1c-ssd 10 0.85999 0.34116 859G 456G 402G 53.11 1.46 osd.10 11 0.85999 0.38770 859G 488G 370G 56.88 1.56 osd.11 12 0.92999 0.41499 929G 556G 372G 59.90 1.64 osd.12 13 0.92999 0.35764 929G 534G 394G 57.53 1.58 osd.13 14 0.93500 0.38669 934G 516G 417G 55.29 1.52 osd.14 -1 21.59995 - 22004G 4929G 17074G 22.40 0.61 root sata -7 7.19998 - 7334G 1644G 5690G 22.42 0.62 host remrprv1c-sata 19 3.59999 1.00000 3667G 819G 2848G 22.33 0.61 osd.19 20 3.59999 1.00000 3667G 825G 2841G 22.51 0.62 osd.20 -6 7.19998 - 7334G 1642G 5691G 22.40 0.61 host remrprv1b-sata 17 3.59999 1.00000 3667G 806G 2860G 21.99 0.60 osd.17 18 3.59999 1.00000 3667G 836G 2831G 22.80 0.63 osd.18 -5 7.19998 - 7334G 1642G 5692G 22.39 0.61 host remrprv1a-sata 15 3.59999 1.00000 3667G 853G 2813G 23.28 0.64 osd.15 16 3.59999 1.00000 3667G 788G 2879G 21.49 0.59 osd.16 TOTAL 35538G 12948G 22590G 36.43 MIN/MAX VAR: 0.59/1.90 STDDEV: 19.22 here's ceph -s: [root@remrprv1c ceph]# ceph -s cluster ff21618e-5aea-4cfe-83b6-a0d2d5b4052a health HEALTH_WARN 3 pgs degraded 2 pgs stale 3 pgs stuck degraded 1 pgs stuck inactive 2 pgs stuck stale 242 pgs stuck unclean 3 pgs stuck undersized 3 pgs undersized recovery 75/3374541 objects degraded (0.002%) recovery 186194/3374541 objects misplaced (5.518%) mds0: Behind on trimming (155/30) monmap e3: 3 mons at {remrprv1a=10.0.0.1:6789/0,remrprv1b=10.0.0.2:6789/0,remrprv1c=10.0.0.3:6789/0} election epoch 522, quorum 0,1,2 remrprv1a,remrprv1b,remrprv1c mdsmap e347: 1/1/1 up {0=remrprv1a=up:active}, 2 up:standby osdmap e4423: 21 osds: 21 up, 21 in; 238 remapped pgs pgmap v18686541: 1856 pgs, 7 pools, 4224 GB data, 1103 kobjects 12948 GB used, 22590 GB / 35538 GB avail 75/3374541 objects degraded (0.002%) 186194/3374541 objects misplaced (5.518%) 1612 active+clean 238 active+remapped 3 active+undersized+degraded 2 stale+active+clean 1 creating client io 14830 B/s rd, 269 kB/s wr, 94 op/s I'd be very gratefull for any help with those.. with best regards nik -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@xxxxxxxxxxx -------------------------------------
Attachment:
pgpwzh9oO7Sbj.pgp
Description: PGP signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com