Re: Cluster Map Problems

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 28 Mar 2013 08:38:56 -0700



This is the perfectly normal distinction between "down" and "out". The
OSD has been marked down but there's a timeout period (default: 5
minutes) before it's marked "out" and the data gets reshuffled (to
avoid starting replication on a simple reboot, for instance).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Mar 28, 2013 at 4:28 AM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
> Hi Dan,
>
> so I change the crushmap to:
>
> rule rbd {
>         ruleset 2
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step chooseleaf firstn 0 type rack
>         step emit
> }
>
> than I look at one pg:
>
> 2.33d   1       0       0       0       4194304 0       0       active+clean    2013-03-28 12:12:09.937610
> 6'1     21'37   [1,18]  [1,18]  6'1     2013-03-28 11:46:40.949643      6'1     2013-03-28
> 11:46:40.949643
>
> so everything is right, the pg is mapped to osd.1 and osd.18, which are
> two different racks.
>
> Than I shut down osd.1 and the cluster becomes degraded.
>
> ceph -s
>    health HEALTH_WARN 378 pgs degraded; 378 pgs stuck unclean; recovery
> 39/904 degraded (4.314%); 1/24 in osds are down
>    monmap e1: 3 mons at
> {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0},
> election epoch 6, quorum 0,1,2 a,b,c
>    osdmap e24: 24 osds: 23 up, 24 in
>     pgmap v426: 4800 pgs: 4422 active+clean, 378 active+degraded; 1800
> MB data, 3789 MB used, 174 TB / 174 TB avail; 39/904 degraded (4.314%)
>    mdsmap e1: 0/0/1 up
>
>
> If I now look at the pg.
>
> 2.33d   1       0       1       0       4194304 0       0       active+degraded 2013-03-28
> 12:16:11.049936 6'1     23'39   [18]    [18]    6'1     2013-03-28 11:46:40.949643      6'1
> 2013-03-28 11:46:40.949643
>
> It seems that the crush algo. doesn't find a new place for the replica.
>
> ceph pg 2.33d query
>
> { "state": "active+degraded",
>   "epoch": 24,
>   "up": [
>         18],
>   "acting": [
>         18],
>   "info": { "pgid": "2.33d",
>       "last_update": "6'1",
>       "last_complete": "6'1",
>       "log_tail": "0'0",
>       "last_backfill": "MAX",
>       "purged_snaps": "[]",
>       "history": { "epoch_created": 1,
>           "last_epoch_started": 24,
>           "last_epoch_clean": 24,
>           "last_epoch_split": 0,
>           "same_up_since": 23,
>           "same_interval_since": 23,
>           "same_primary_since": 23,
>           "last_scrub": "6'1",
>           "last_scrub_stamp": "2013-03-28 11:46:40.949643",
>           "last_deep_scrub": "6'1",
>           "last_deep_scrub_stamp": "2013-03-28 11:46:40.949643",
>           "last_clean_scrub_stamp": "2013-03-28 11:46:40.949643"},
>       "stats": { "version": "6'1",
>           "reported": "23'39",
>           "state": "active+degraded",
>           "last_fresh": "2013-03-28 12:16:11.059607",
>           "last_change": "2013-03-28 12:16:11.049936",
>           "last_active": "2013-03-28 12:16:11.059607",
>           "last_clean": "2013-03-28 11:44:59.181618",
>           "last_unstale": "2013-03-28 12:16:11.059607",
>           "mapping_epoch": 21,
>           "log_start": "0'0",
>           "ondisk_log_start": "0'0",
>           "created": 1,
>           "last_epoch_clean": 1,
>           "parent": "0.0",
>           "parent_split_bits": 0,
>           "last_scrub": "6'1",
>           "last_scrub_stamp": "2013-03-28 11:46:40.949643",
>           "last_deep_scrub": "6'1",
>           "last_deep_scrub_stamp": "2013-03-28 11:46:40.949643",
>           "last_clean_scrub_stamp": "2013-03-28 11:46:40.949643",
>           "log_size": 0,
>           "ondisk_log_size": 0,
>           "stats_invalid": "0",
>           "stat_sum": { "num_bytes": 4194304,
>               "num_objects": 1,
>               "num_object_clones": 0,
>               "num_object_copies": 0,
>               "num_objects_missing_on_primary": 0,
>               "num_objects_degraded": 0,
>               "num_objects_unfound": 0,
>               "num_read": 0,
>               "num_read_kb": 0,
>               "num_write": 1,
>               "num_write_kb": 4096,
>               "num_scrub_errors": 0,
>               "num_objects_recovered": 2,
>               "num_bytes_recovered": 8388608,
>               "num_keys_recovered": 0},
>           "stat_cat_sum": {},
>           "up": [
>                 18],
>           "acting": [
>                 18]},
>       "empty": 0,
>       "dne": 0,
>       "incomplete": 0,
>       "last_epoch_started": 24},
>   "recovery_state": [
>         { "name": "Started\/Primary\/Active",
>           "enter_time": "2013-03-28 12:16:11.049925",
>           "might_have_unfound": [],
>           "recovery_progress": { "backfill_target": -1,
>               "waiting_on_backfill": 0,
>               "backfill_pos": "0\/\/0\/\/-1",
>               "backfill_info": { "begin": "0\/\/0\/\/-1",
>                   "end": "0\/\/0\/\/-1",
>                   "objects": []},
>               "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
>                   "end": "0\/\/0\/\/-1",
>                   "objects": []},
>               "backfills_in_flight": [],
>               "pull_from_peer": [],
>               "pushing": []},
>           "scrub": { "scrubber.epoch_start": "0",
>               "scrubber.active": 0,
>               "scrubber.block_writes": 0,
>               "scrubber.finalizing": 0,
>               "scrubber.waiting_on": 0,
>               "scrubber.waiting_on_whom": []}},
>         { "name": "Started",
>           "enter_time": "2013-03-28 12:16:10.029226"}]}
>
> -martin
>
>
> On 28.03.2013 08:57, Dan van der Ster wrote:
>> Shouldn't it just be:
>>
>>         step take default
>>         step chooseleaf firstn 0 type rack
>>         step emit
>>
>> Like he has for data and metadata?
>>
>> --
>> Dan
>>
>> On Thu, Mar 28, 2013 at 2:51 AM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
>>> Hi John,
>>>
>>> I still think this part in the crushmap is wrong.
>>>
>>>         step take default
>>>         step choose firstn 0 type rack
>>>         step chooseleaf firstn 0 type host
>>>         step emit
>>>
>>> I first take from the defaut -> that's okay,
>>> Now I take two from the rack -> that's still ok
>>> But now, I will take 2 host in each rack, -> that would be 4 locations,
>>> but I have a replication level of 2.
>>>
>>> Or don't I understand the placement right?
>>>
>>> -martin
>>>
>>> On 28.03.2013 02:25, John Wilkins wrote:
>>>> So the OSD you shutdown is down and in. How long does it stay in the
>>>> degraded state? In the docs here,
>>>> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/ , we
>>>> discuss the notion that a down OSD is not technically out of the
>>>> cluster for awhile. I believe the default value is 300 seconds, which
>>>> is about 5 minutes. From what I can see from your "ceph osd tree"
>>>> command, all your OSDs are running. You can change the time it takes
>>>> to mark a down OSD out. That's " mon osd down out interval", discussed
>>>> in this section:
>>>> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/#degraded
>>>>
>>>> On Wed, Mar 27, 2013 at 5:56 PM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
>>>>> Hi,
>>>>>
>>>>> that's the config http://pastebin.com/2JzABSYt
>>>>> ceph osd dump http://pastebin.com/GSCGKL1k
>>>>> ceph osd tree http://pastebin.com/VSgPFRYv
>>>>>
>>>>> As far as I can tell they are not mapped right.
>>>>>
>>>>> sdmap e133 pool 'rbd' (2) object '2.31a' -> pg 2.f3caaf00 (2.300) -> up
>>>>> [13,23] acting [13,23]
>>>>>
>>>>> -martin
>>>>>
>>>>> On 28.03.2013 01:09, John Wilkins wrote:
>>>>>> We need a bit more information. If you can do: "ceph osd dump", "ceph
>>>>>> osd tree", and paste your ceph conf, we might get a bit further. The
>>>>>> CRUSH hierarchy looks okay. I can't see the replica size from this
>>>>>> though.
>>>>>>
>>>>>> Have you followed this procedure to see if your object is getting
>>>>>> remapped? http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/#finding-an-object-location
>>>>>>
>>>>>> On Thu, Mar 21, 2013 at 12:02 PM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I want to change my crushmap to reflect my setup, I have two racks with
>>>>>>> each 3 hosts. I want to use for the rbd pool a replication size of 2.
>>>>>>> The failure domain should be the rack, so each replica should be in each
>>>>>>> rack. That works so far.
>>>>>>> But if I shutdown a host the clusters stays degraded, but I want that
>>>>>>> the now missing replicas get replicated to the two remaining hosts in
>>>>>>> this rack.
>>>>>>>
>>>>>>> Here is crushmap.
>>>>>>> http://pastebin.com/UaB6LfKs
>>>>>>>
>>>>>>> Any idea what I did wrong?
>>>>>>>
>>>>>>> -martin
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com