Hi Dan, so I change the crushmap to: rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type rack step emit } than I look at one pg: 2.33d 1 0 0 0 4194304 0 0 active+clean 2013-03-28 12:12:09.937610 6'1 21'37 [1,18] [1,18] 6'1 2013-03-28 11:46:40.949643 6'1 2013-03-28 11:46:40.949643 so everything is right, the pg is mapped to osd.1 and osd.18, which are two different racks. Than I shut down osd.1 and the cluster becomes degraded. ceph -s health HEALTH_WARN 378 pgs degraded; 378 pgs stuck unclean; recovery 39/904 degraded (4.314%); 1/24 in osds are down monmap e1: 3 mons at {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, election epoch 6, quorum 0,1,2 a,b,c osdmap e24: 24 osds: 23 up, 24 in pgmap v426: 4800 pgs: 4422 active+clean, 378 active+degraded; 1800 MB data, 3789 MB used, 174 TB / 174 TB avail; 39/904 degraded (4.314%) mdsmap e1: 0/0/1 up If I now look at the pg. 2.33d 1 0 1 0 4194304 0 0 active+degraded 2013-03-28 12:16:11.049936 6'1 23'39 [18] [18] 6'1 2013-03-28 11:46:40.949643 6'1 2013-03-28 11:46:40.949643 It seems that the crush algo. doesn't find a new place for the replica. ceph pg 2.33d query { "state": "active+degraded", "epoch": 24, "up": [ 18], "acting": [ 18], "info": { "pgid": "2.33d", "last_update": "6'1", "last_complete": "6'1", "log_tail": "0'0", "last_backfill": "MAX", "purged_snaps": "[]", "history": { "epoch_created": 1, "last_epoch_started": 24, "last_epoch_clean": 24, "last_epoch_split": 0, "same_up_since": 23, "same_interval_since": 23, "same_primary_since": 23, "last_scrub": "6'1", "last_scrub_stamp": "2013-03-28 11:46:40.949643", "last_deep_scrub": "6'1", "last_deep_scrub_stamp": "2013-03-28 11:46:40.949643", "last_clean_scrub_stamp": "2013-03-28 11:46:40.949643"}, "stats": { "version": "6'1", "reported": "23'39", "state": "active+degraded", "last_fresh": "2013-03-28 12:16:11.059607", "last_change": "2013-03-28 12:16:11.049936", "last_active": "2013-03-28 12:16:11.059607", "last_clean": "2013-03-28 11:44:59.181618", "last_unstale": "2013-03-28 12:16:11.059607", "mapping_epoch": 21, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 1, "last_epoch_clean": 1, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "6'1", "last_scrub_stamp": "2013-03-28 11:46:40.949643", "last_deep_scrub": "6'1", "last_deep_scrub_stamp": "2013-03-28 11:46:40.949643", "last_clean_scrub_stamp": "2013-03-28 11:46:40.949643", "log_size": 0, "ondisk_log_size": 0, "stats_invalid": "0", "stat_sum": { "num_bytes": 4194304, "num_objects": 1, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_degraded": 0, "num_objects_unfound": 0, "num_read": 0, "num_read_kb": 0, "num_write": 1, "num_write_kb": 4096, "num_scrub_errors": 0, "num_objects_recovered": 2, "num_bytes_recovered": 8388608, "num_keys_recovered": 0}, "stat_cat_sum": {}, "up": [ 18], "acting": [ 18]}, "empty": 0, "dne": 0, "incomplete": 0, "last_epoch_started": 24}, "recovery_state": [ { "name": "Started\/Primary\/Active", "enter_time": "2013-03-28 12:16:11.049925", "might_have_unfound": [], "recovery_progress": { "backfill_target": -1, "waiting_on_backfill": 0, "backfill_pos": "0\/\/0\/\/-1", "backfill_info": { "begin": "0\/\/0\/\/-1", "end": "0\/\/0\/\/-1", "objects": []}, "peer_backfill_info": { "begin": "0\/\/0\/\/-1", "end": "0\/\/0\/\/-1", "objects": []}, "backfills_in_flight": [], "pull_from_peer": [], "pushing": []}, "scrub": { "scrubber.epoch_start": "0", "scrubber.active": 0, "scrubber.block_writes": 0, "scrubber.finalizing": 0, "scrubber.waiting_on": 0, "scrubber.waiting_on_whom": []}}, { "name": "Started", "enter_time": "2013-03-28 12:16:10.029226"}]} -martin On 28.03.2013 08:57, Dan van der Ster wrote: > Shouldn't it just be: > > step take default > step chooseleaf firstn 0 type rack > step emit > > Like he has for data and metadata? > > -- > Dan > > On Thu, Mar 28, 2013 at 2:51 AM, Martin Mailand <martin@xxxxxxxxxxxx> wrote: >> Hi John, >> >> I still think this part in the crushmap is wrong. >> >> step take default >> step choose firstn 0 type rack >> step chooseleaf firstn 0 type host >> step emit >> >> I first take from the defaut -> that's okay, >> Now I take two from the rack -> that's still ok >> But now, I will take 2 host in each rack, -> that would be 4 locations, >> but I have a replication level of 2. >> >> Or don't I understand the placement right? >> >> -martin >> >> On 28.03.2013 02:25, John Wilkins wrote: >>> So the OSD you shutdown is down and in. How long does it stay in the >>> degraded state? In the docs here, >>> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/ , we >>> discuss the notion that a down OSD is not technically out of the >>> cluster for awhile. I believe the default value is 300 seconds, which >>> is about 5 minutes. From what I can see from your "ceph osd tree" >>> command, all your OSDs are running. You can change the time it takes >>> to mark a down OSD out. That's " mon osd down out interval", discussed >>> in this section: >>> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/#degraded >>> >>> On Wed, Mar 27, 2013 at 5:56 PM, Martin Mailand <martin@xxxxxxxxxxxx> wrote: >>>> Hi, >>>> >>>> that's the config http://pastebin.com/2JzABSYt >>>> ceph osd dump http://pastebin.com/GSCGKL1k >>>> ceph osd tree http://pastebin.com/VSgPFRYv >>>> >>>> As far as I can tell they are not mapped right. >>>> >>>> sdmap e133 pool 'rbd' (2) object '2.31a' -> pg 2.f3caaf00 (2.300) -> up >>>> [13,23] acting [13,23] >>>> >>>> -martin >>>> >>>> On 28.03.2013 01:09, John Wilkins wrote: >>>>> We need a bit more information. If you can do: "ceph osd dump", "ceph >>>>> osd tree", and paste your ceph conf, we might get a bit further. The >>>>> CRUSH hierarchy looks okay. I can't see the replica size from this >>>>> though. >>>>> >>>>> Have you followed this procedure to see if your object is getting >>>>> remapped? http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/#finding-an-object-location >>>>> >>>>> On Thu, Mar 21, 2013 at 12:02 PM, Martin Mailand <martin@xxxxxxxxxxxx> wrote: >>>>>> Hi, >>>>>> >>>>>> I want to change my crushmap to reflect my setup, I have two racks with >>>>>> each 3 hosts. I want to use for the rbd pool a replication size of 2. >>>>>> The failure domain should be the rack, so each replica should be in each >>>>>> rack. That works so far. >>>>>> But if I shutdown a host the clusters stays degraded, but I want that >>>>>> the now missing replicas get replicated to the two remaining hosts in >>>>>> this rack. >>>>>> >>>>>> Here is crushmap. >>>>>> http://pastebin.com/UaB6LfKs >>>>>> >>>>>> Any idea what I did wrong? >>>>>> >>>>>> -martin >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> >>>>> >>> >>> >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com