Hi Greg, I have a custom crush map, which I attached below. My Goal is it to have two racks, each rack should be a failure domain. That means for the rbd pool, which I use with a replication level of two, I want one replica in one rack and the other replica in the other rack. So that I could loose a whole rack and still all data is available. At the moment I just shut down one host in one of the racks. I would expect that the now missing objects get replicated from the other rack to the remaining host in the first rack, where I shut down one host. But with my crushmap that doesn't work, therefore I think my crushmap is not right. -martin # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 device 12 osd.12 device 13 osd.13 device 14 osd.14 device 15 osd.15 device 16 osd.16 device 17 osd.17 device 18 osd.18 device 19 osd.19 device 20 osd.20 device 21 osd.21 device 22 osd.22 device 23 osd.23 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 root # buckets host store1 { id -5 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 item osd.1 weight 1.000 item osd.2 weight 1.000 item osd.3 weight 1.000 } host store3 { id -7 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item osd.10 weight 1.000 item osd.11 weight 1.000 item osd.8 weight 1.000 item osd.9 weight 1.000 } host store4 { id -8 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item osd.12 weight 1.000 item osd.13 weight 1.000 item osd.14 weight 1.000 item osd.15 weight 1.000 } host store5 { id -9 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item osd.16 weight 1.000 item osd.17 weight 1.000 item osd.18 weight 1.000 item osd.19 weight 1.000 } host store6 { id -10 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item osd.20 weight 1.000 item osd.21 weight 1.000 item osd.22 weight 1.000 item osd.23 weight 1.000 } host store2 { id -6 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item osd.4 weight 1.000 item osd.5 weight 1.000 item osd.6 weight 1.000 item osd.7 weight 1.000 } rack rack1 { id -3 # do not change unnecessarily # weight 12.000 alg straw hash 0 # rjenkins1 item store1 weight 4.000 item store2 weight 4.000 item store3 weight 4.000 } rack rack2 { id -4 # do not change unnecessarily # weight 12.000 alg straw hash 0 # rjenkins1 item store4 weight 4.000 item store5 weight 4.000 item store6 weight 4.000 } root default { id -1 # do not change unnecessarily # weight 24.000 alg straw hash 0 # rjenkins1 item rack1 weight 12.000 item rack2 weight 12.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type rack step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type rack step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type rack step emit } # end crush map On 28.03.2013 18:44, Gregory Farnum wrote: > Looks like you either have a custom config, or have specified > somewhere that OSDs shouldn't be marked out. (ie, setting the 'noout' > flag). There can also be a bit of flux if your OSDs are reporting an > unusual number of failures, but you'd have seen failure reports if > that were going on. > -Greg > > On Thu, Mar 28, 2013 at 10:35 AM, Martin Mailand <martin@xxxxxxxxxxxx> wrote: >> Hi Greg, >> >> /etc/init.d/ceph stop osd.1 >> === osd.1 === >> Stopping Ceph osd.1 on store1...kill 13413...done >> root@store1:~# date -R >> Thu, 28 Mar 2013 18:22:05 +0100 >> root@store1:~# ceph -s >> health HEALTH_WARN 378 pgs degraded; 378 pgs stuck unclean; recovery >> 39/904 degraded (4.314%); recovering 15E o/s, 15EB/s; 1/24 in osds are down >> monmap e1: 3 mons at >> {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, >> election epoch 6, quorum 0,1,2 a,b,c >> osdmap e28: 24 osds: 23 up, 24 in >> pgmap v449: 4800 pgs: 4422 active+clean, 378 active+degraded; 1800 >> MB data, 3800 MB used, 174 TB / 174 TB avail; 39/904 degraded (4.314%); >> recovering 15E o/s, 15EB/s >> mdsmap e1: 0/0/1 up >> >> >> 10 mins later, still the same >> >> root@store1:~# date -R >> Thu, 28 Mar 2013 18:32:24 +0100 >> root@store1:~# ceph -s >> health HEALTH_WARN 378 pgs degraded; 378 pgs stuck unclean; recovery >> 39/904 degraded (4.314%); 1/24 in osds are down >> monmap e1: 3 mons at >> {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, >> election epoch 6, quorum 0,1,2 a,b,c >> osdmap e28: 24 osds: 23 up, 24 in >> pgmap v454: 4800 pgs: 4422 active+clean, 378 active+degraded; 1800 >> MB data, 3780 MB used, 174 TB / 174 TB avail; 39/904 degraded (4.314%) >> mdsmap e1: 0/0/1 up >> >> root@store1:~# >> >> >> -martin >> >> On 28.03.2013 16:38, Gregory Farnum wrote: >>> This is the perfectly normal distinction between "down" and "out". The >>> OSD has been marked down but there's a timeout period (default: 5 >>> minutes) before it's marked "out" and the data gets reshuffled (to >>> avoid starting replication on a simple reboot, for instance). _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com