Hi, colleagues! I have small 4-node CEPH cluster (0.94.2), all pools have size 3, min_size 1. This night one host failed and cluster was unable to rebalance saying there are a lot of undersized pgs. root@slpeah002:[~]:# ceph -s cluster 78eef61a-3e9c-447c-a3ec-ce84c617d728 health HEALTH_WARN 1486 pgs degraded 1486 pgs stuck degraded 2257 pgs stuck unclean 1486 pgs stuck undersized 1486 pgs undersized recovery 80429/555185 objects degraded (14.487%) recovery 40079/555185 objects misplaced (7.219%) 4/20 in osds are down 1 mons down, quorum 1,2 slpeah002,slpeah007 monmap e7: 3 mons at {slpeah001=192.168.254.11:6780/0,slpeah002=192.168.254.12:6780/0,slpeah007=172.31.252.46:6789/0} election epoch 710, quorum 1,2 slpeah002,slpeah007 osdmap e14062: 20 osds: 16 up, 20 in; 771 remapped pgs pgmap v7021316: 4160 pgs, 5 pools, 1045 GB data, 180 kobjects 3366 GB used, 93471 GB / 96838 GB avail 80429/555185 objects degraded (14.487%) 40079/555185 objects misplaced (7.219%) 1903 active+clean 1486 active+undersized+degraded 771 active+remapped client io 0 B/s rd, 246 kB/s wr, 67 op/s root@slpeah002:[~]:# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 94.63998 root default -9 32.75999 host slpeah007 72 5.45999 osd.72 up 1.00000 1.00000 73 5.45999 osd.73 up 1.00000 1.00000 74 5.45999 osd.74 up 1.00000 1.00000 75 5.45999 osd.75 up 1.00000 1.00000 76 5.45999 osd.76 up 1.00000 1.00000 77 5.45999 osd.77 up 1.00000 1.00000 -10 32.75999 host slpeah008 78 5.45999 osd.78 up 1.00000 1.00000 79 5.45999 osd.79 up 1.00000 1.00000 80 5.45999 osd.80 up 1.00000 1.00000 81 5.45999 osd.81 up 1.00000 1.00000 82 5.45999 osd.82 up 1.00000 1.00000 83 5.45999 osd.83 up 1.00000 1.00000 -3 14.56000 host slpeah001 1 3.64000 osd.1 down 1.00000 1.00000 33 3.64000 osd.33 down 1.00000 1.00000 34 3.64000 osd.34 down 1.00000 1.00000 35 3.64000 osd.35 down 1.00000 1.00000 -2 14.56000 host slpeah002 0 3.64000 osd.0 up 1.00000 1.00000 36 3.64000 osd.36 up 1.00000 1.00000 37 3.64000 osd.37 up 1.00000 1.00000 38 3.64000 osd.38 up 1.00000 1.00000 Crushmap: # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 device 1 osd.1 device 2 device2 device 3 device3 device 4 device4 device 5 device5 device 6 device6 device 7 device7 device 8 device8 device 9 device9 device 10 device10 device 11 device11 device 12 device12 device 13 device13 device 14 device14 device 15 device15 device 16 device16 device 17 device17 device 18 device18 device 19 device19 device 20 device20 device 21 device21 device 22 device22 device 23 device23 device 24 device24 device 25 device25 device 26 device26 device 27 device27 device 28 device28 device 29 device29 device 30 device30 device 31 device31 device 32 device32 device 33 osd.33 device 34 osd.34 device 35 osd.35 device 36 osd.36 device 37 osd.37 device 38 osd.38 device 39 device39 device 40 device40 device 41 device41 device 42 device42 device 43 device43 device 44 device44 device 45 device45 device 46 device46 device 47 device47 device 48 device48 device 49 device49 device 50 device50 device 51 device51 device 52 device52 device 53 device53 device 54 device54 device 55 device55 device 56 device56 device 57 device57 device 58 device58 device 59 device59 device 60 device60 device 61 device61 device 62 device62 device 63 device63 device 64 device64 device 65 device65 device 66 device66 device 67 device67 device 68 device68 device 69 device69 device 70 device70 device 71 device71 device 72 osd.72 device 73 osd.73 device 74 osd.74 device 75 osd.75 device 76 osd.76 device 77 osd.77 device 78 osd.78 device 79 osd.79 device 80 osd.80 device 81 osd.81 device 82 osd.82 device 83 osd.83 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host slpeah007 { id -9 # do not change unnecessarily # weight 32.760 alg straw hash 0 # rjenkins1 item osd.72 weight 5.460 item osd.73 weight 5.460 item osd.74 weight 5.460 item osd.75 weight 5.460 item osd.76 weight 5.460 item osd.77 weight 5.460 } host slpeah008 { id -10 # do not change unnecessarily # weight 32.760 alg straw hash 0 # rjenkins1 item osd.78 weight 5.460 item osd.79 weight 5.460 item osd.80 weight 5.460 item osd.81 weight 5.460 item osd.82 weight 5.460 item osd.83 weight 5.460 } host slpeah001 { id -3 # do not change unnecessarily # weight 14.560 alg straw hash 0 # rjenkins1 item osd.1 weight 3.640 item osd.33 weight 3.640 item osd.34 weight 3.640 item osd.35 weight 3.640 } host slpeah002 { id -2 # do not change unnecessarily # weight 14.560 alg straw hash 0 # rjenkins1 item osd.0 weight 3.640 item osd.36 weight 3.640 item osd.37 weight 3.640 item osd.38 weight 3.640 } root default { id -1 # do not change unnecessarily # weight 94.640 alg straw hash 0 # rjenkins1 item slpeah007 weight 32.760 item slpeah008 weight 32.760 item slpeah001 weight 14.560 item slpeah002 weight 14.560 } # rules rule default { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map This is odd because pools have size 3 and I have 3 hosts alive, so why it is saying that undersized pgs are present? It makes me feel like CRUSH is not working properly. There is not much data currently in cluster, something about 3TB and as you can see from osd tree - each host have minimum of 14TB disk space on OSDs. So I'm a bit stuck now... How can I find the source of trouble? Thanks in advance! _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com