Re: Cluster Map Problems

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 28 Mar 2013 11:01:08 -0700



Your crush map looks fine to me. I'm saying that your ceph -s output
showed the OSD still hadn't been marked out. No data will be migrated
until it's marked out.
After ten minutes it should have been marked out, but that's based on
a number of factors you have some control over. If you just want a
quick check of your crush map you can mark it out manually, too.
-Greg

On Thu, Mar 28, 2013 at 10:54 AM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
> Hi Greg,
>
> I have a custom crush map, which I attached below.
> My Goal is it to have two racks, each rack should be a failure domain.
> That means for the rbd pool, which I use with a replication level of
> two, I want one replica in one rack and the other replica in the other
> rack. So that I could loose a whole rack and still all data is available.
>
> At the moment I just shut down one host in one of the racks. I would
> expect that the now missing objects get replicated from the other rack
> to the remaining host in the first rack, where I shut down one host.
>
> But with my crushmap that doesn't work, therefore I think my crushmap is
> not right.
>
> -martin
>
>
> # begin crush map
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
>
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 root
>
> # buckets
> host store1 {
>         id -5           # do not change unnecessarily
>         # weight 4.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.0 weight 1.000
>         item osd.1 weight 1.000
>         item osd.2 weight 1.000
>         item osd.3 weight 1.000
> }
> host store3 {
>         id -7           # do not change unnecessarily
>         # weight 4.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.10 weight 1.000
>         item osd.11 weight 1.000
>         item osd.8 weight 1.000
>         item osd.9 weight 1.000
> }
> host store4 {
>         id -8           # do not change unnecessarily
>         # weight 4.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.12 weight 1.000
>         item osd.13 weight 1.000
>         item osd.14 weight 1.000
>         item osd.15 weight 1.000
> }
> host store5 {
>         id -9           # do not change unnecessarily
>         # weight 4.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.16 weight 1.000
>         item osd.17 weight 1.000
>         item osd.18 weight 1.000
>         item osd.19 weight 1.000
> }
> host store6 {
>         id -10          # do not change unnecessarily
>         # weight 4.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.20 weight 1.000
>         item osd.21 weight 1.000
>         item osd.22 weight 1.000
>         item osd.23 weight 1.000
> }
> host store2 {
>         id -6           # do not change unnecessarily
>         # weight 4.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.4 weight 1.000
>         item osd.5 weight 1.000
>         item osd.6 weight 1.000
>         item osd.7 weight 1.000
> }
> rack rack1 {
>         id -3           # do not change unnecessarily
>         # weight 12.000
>         alg straw
>         hash 0  # rjenkins1
>         item store1 weight 4.000
>         item store2 weight 4.000
>         item store3 weight 4.000
> }
> rack rack2 {
>         id -4           # do not change unnecessarily
>         # weight 12.000
>         alg straw
>         hash 0  # rjenkins1
>         item store4 weight 4.000
>         item store5 weight 4.000
>         item store6 weight 4.000
> }
> root default {
>         id -1           # do not change unnecessarily
>         # weight 24.000
>         alg straw
>         hash 0  # rjenkins1
>         item rack1 weight 12.000
>         item rack2 weight 12.000
> }
>
> # rules
> rule data {
>         ruleset 0
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step chooseleaf firstn 0 type rack
>         step emit
> }
> rule metadata {
>         ruleset 1
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step chooseleaf firstn 0 type rack
>         step emit
> }
> rule rbd {
>         ruleset 2
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step chooseleaf firstn 0 type rack
>         step emit
> }
>
> # end crush map
>
>
> On 28.03.2013 18:44, Gregory Farnum wrote:
>> Looks like you either have a custom config, or have specified
>> somewhere that OSDs shouldn't be marked out. (ie, setting the 'noout'
>> flag). There can also be a bit of flux if your OSDs are reporting an
>> unusual number of failures, but you'd have seen failure reports if
>> that were going on.
>> -Greg
>>
>> On Thu, Mar 28, 2013 at 10:35 AM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
>>> Hi Greg,
>>>
>>>  /etc/init.d/ceph stop osd.1
>>> === osd.1 ===
>>> Stopping Ceph osd.1 on store1...kill 13413...done
>>> root@store1:~# date -R
>>> Thu, 28 Mar 2013 18:22:05 +0100
>>> root@store1:~# ceph -s
>>>    health HEALTH_WARN 378 pgs degraded; 378 pgs stuck unclean; recovery
>>> 39/904 degraded (4.314%);  recovering 15E o/s, 15EB/s; 1/24 in osds are down
>>>    monmap e1: 3 mons at
>>> {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0},
>>> election epoch 6, quorum 0,1,2 a,b,c
>>>    osdmap e28: 24 osds: 23 up, 24 in
>>>     pgmap v449: 4800 pgs: 4422 active+clean, 378 active+degraded; 1800
>>> MB data, 3800 MB used, 174 TB / 174 TB avail; 39/904 degraded (4.314%);
>>>  recovering 15E o/s, 15EB/s
>>>    mdsmap e1: 0/0/1 up
>>>
>>>
>>> 10 mins later, still the same
>>>
>>> root@store1:~# date -R
>>> Thu, 28 Mar 2013 18:32:24 +0100
>>> root@store1:~# ceph -s
>>>    health HEALTH_WARN 378 pgs degraded; 378 pgs stuck unclean; recovery
>>> 39/904 degraded (4.314%); 1/24 in osds are down
>>>    monmap e1: 3 mons at
>>> {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0},
>>> election epoch 6, quorum 0,1,2 a,b,c
>>>    osdmap e28: 24 osds: 23 up, 24 in
>>>     pgmap v454: 4800 pgs: 4422 active+clean, 378 active+degraded; 1800
>>> MB data, 3780 MB used, 174 TB / 174 TB avail; 39/904 degraded (4.314%)
>>>    mdsmap e1: 0/0/1 up
>>>
>>> root@store1:~#
>>>
>>>
>>> -martin
>>>
>>> On 28.03.2013 16:38, Gregory Farnum wrote:
>>>> This is the perfectly normal distinction between "down" and "out". The
>>>> OSD has been marked down but there's a timeout period (default: 5
>>>> minutes) before it's marked "out" and the data gets reshuffled (to
>>>> avoid starting replication on a simple reboot, for instance).
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com