Re: Cluster Map Problems

Martin Mailand <martin@xxxxxxxxxxxx> · Thu, 28 Mar 2013 12:28:18 +0100

Hi Dan,

so I change the crushmap to:

rule rbd {
        ruleset 2
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type rack
        step emit
}

than I look at one pg:

2.33d	1	0	0	0	4194304	0	0	active+clean	2013-03-28 12:12:09.937610
6'1	21'37	[1,18]	[1,18]	6'1	2013-03-28 11:46:40.949643	6'1	2013-03-28
11:46:40.949643

so everything is right, the pg is mapped to osd.1 and osd.18, which are
two different racks.

Than I shut down osd.1 and the cluster becomes degraded.

ceph -s
   health HEALTH_WARN 378 pgs degraded; 378 pgs stuck unclean; recovery
39/904 degraded (4.314%); 1/24 in osds are down
   monmap e1: 3 mons at
{a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0},
election epoch 6, quorum 0,1,2 a,b,c
   osdmap e24: 24 osds: 23 up, 24 in
    pgmap v426: 4800 pgs: 4422 active+clean, 378 active+degraded; 1800
MB data, 3789 MB used, 174 TB / 174 TB avail; 39/904 degraded (4.314%)
   mdsmap e1: 0/0/1 up

If I now look at the pg.

2.33d	1	0	1	0	4194304	0	0	active+degraded	2013-03-28
12:16:11.049936	6'1	23'39	[18]	[18]	6'1	2013-03-28 11:46:40.949643	6'1
2013-03-28 11:46:40.949643

It seems that the crush algo. doesn't find a new place for the replica.

ceph pg 2.33d query

{ "state": "active+degraded",
  "epoch": 24,
  "up": [
        18],
  "acting": [
        18],
  "info": { "pgid": "2.33d",
      "last_update": "6'1",
      "last_complete": "6'1",
      "log_tail": "0'0",
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 1,
          "last_epoch_started": 24,
          "last_epoch_clean": 24,
          "last_epoch_split": 0,
          "same_up_since": 23,
          "same_interval_since": 23,
          "same_primary_since": 23,
          "last_scrub": "6'1",
          "last_scrub_stamp": "2013-03-28 11:46:40.949643",
          "last_deep_scrub": "6'1",
          "last_deep_scrub_stamp": "2013-03-28 11:46:40.949643",
          "last_clean_scrub_stamp": "2013-03-28 11:46:40.949643"},
      "stats": { "version": "6'1",
          "reported": "23'39",
          "state": "active+degraded",
          "last_fresh": "2013-03-28 12:16:11.059607",
          "last_change": "2013-03-28 12:16:11.049936",
          "last_active": "2013-03-28 12:16:11.059607",
          "last_clean": "2013-03-28 11:44:59.181618",
          "last_unstale": "2013-03-28 12:16:11.059607",
          "mapping_epoch": 21,
          "log_start": "0'0",
          "ondisk_log_start": "0'0",
          "created": 1,
          "last_epoch_clean": 1,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "6'1",
          "last_scrub_stamp": "2013-03-28 11:46:40.949643",
          "last_deep_scrub": "6'1",
          "last_deep_scrub_stamp": "2013-03-28 11:46:40.949643",
          "last_clean_scrub_stamp": "2013-03-28 11:46:40.949643",
          "log_size": 0,
          "ondisk_log_size": 0,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 4194304,
              "num_objects": 1,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 1,
              "num_write_kb": 4096,
              "num_scrub_errors": 0,
              "num_objects_recovered": 2,
              "num_bytes_recovered": 8388608,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "up": [
                18],
          "acting": [
                18]},
      "empty": 0,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 24},
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2013-03-28 12:16:11.049925",
          "might_have_unfound": [],
          "recovery_progress": { "backfill_target": -1,
              "waiting_on_backfill": 0,
              "backfill_pos": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "backfills_in_flight": [],
              "pull_from_peer": [],
              "pushing": []},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2013-03-28 12:16:10.029226"}]}

-martin

On 28.03.2013 08:57, Dan van der Ster wrote:
> Shouldn't it just be:
> 
>         step take default
>         step chooseleaf firstn 0 type rack
>         step emit
> 
> Like he has for data and metadata?
> 
> --
> Dan
> 
> On Thu, Mar 28, 2013 at 2:51 AM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
>> Hi John,
>>
>> I still think this part in the crushmap is wrong.
>>
>>         step take default
>>         step choose firstn 0 type rack
>>         step chooseleaf firstn 0 type host
>>         step emit
>>
>> I first take from the defaut -> that's okay,
>> Now I take two from the rack -> that's still ok
>> But now, I will take 2 host in each rack, -> that would be 4 locations,
>> but I have a replication level of 2.
>>
>> Or don't I understand the placement right?
>>
>> -martin
>>
>> On 28.03.2013 02:25, John Wilkins wrote:
>>> So the OSD you shutdown is down and in. How long does it stay in the
>>> degraded state? In the docs here,
>>> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/ , we
>>> discuss the notion that a down OSD is not technically out of the
>>> cluster for awhile. I believe the default value is 300 seconds, which
>>> is about 5 minutes. From what I can see from your "ceph osd tree"
>>> command, all your OSDs are running. You can change the time it takes
>>> to mark a down OSD out. That's " mon osd down out interval", discussed
>>> in this section:
>>> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/#degraded
>>>
>>> On Wed, Mar 27, 2013 at 5:56 PM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> that's the config http://pastebin.com/2JzABSYt
>>>> ceph osd dump http://pastebin.com/GSCGKL1k
>>>> ceph osd tree http://pastebin.com/VSgPFRYv
>>>>
>>>> As far as I can tell they are not mapped right.
>>>>
>>>> sdmap e133 pool 'rbd' (2) object '2.31a' -> pg 2.f3caaf00 (2.300) -> up
>>>> [13,23] acting [13,23]
>>>>
>>>> -martin
>>>>
>>>> On 28.03.2013 01:09, John Wilkins wrote:
>>>>> We need a bit more information. If you can do: "ceph osd dump", "ceph
>>>>> osd tree", and paste your ceph conf, we might get a bit further. The
>>>>> CRUSH hierarchy looks okay. I can't see the replica size from this
>>>>> though.
>>>>>
>>>>> Have you followed this procedure to see if your object is getting
>>>>> remapped? http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/#finding-an-object-location
>>>>>
>>>>> On Thu, Mar 21, 2013 at 12:02 PM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I want to change my crushmap to reflect my setup, I have two racks with
>>>>>> each 3 hosts. I want to use for the rbd pool a replication size of 2.
>>>>>> The failure domain should be the rack, so each replica should be in each
>>>>>> rack. That works so far.
>>>>>> But if I shutdown a host the clusters stays degraded, but I want that
>>>>>> the now missing replicas get replicated to the two remaining hosts in
>>>>>> this rack.
>>>>>>
>>>>>> Here is crushmap.
>>>>>> http://pastebin.com/UaB6LfKs
>>>>>>
>>>>>> Any idea what I did wrong?
>>>>>>
>>>>>> -martin
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com