This is the perfectly normal distinction between "down" and "out". The OSD has been marked down but there's a timeout period (default: 5 minutes) before it's marked "out" and the data gets reshuffled (to avoid starting replication on a simple reboot, for instance). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Mar 28, 2013 at 4:28 AM, Martin Mailand <martin@xxxxxxxxxxxx> wrote: > Hi Dan, > > so I change the crushmap to: > > rule rbd { > ruleset 2 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type rack > step emit > } > > than I look at one pg: > > 2.33d 1 0 0 0 4194304 0 0 active+clean 2013-03-28 12:12:09.937610 > 6'1 21'37 [1,18] [1,18] 6'1 2013-03-28 11:46:40.949643 6'1 2013-03-28 > 11:46:40.949643 > > so everything is right, the pg is mapped to osd.1 and osd.18, which are > two different racks. > > Than I shut down osd.1 and the cluster becomes degraded. > > ceph -s > health HEALTH_WARN 378 pgs degraded; 378 pgs stuck unclean; recovery > 39/904 degraded (4.314%); 1/24 in osds are down > monmap e1: 3 mons at > {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, > election epoch 6, quorum 0,1,2 a,b,c > osdmap e24: 24 osds: 23 up, 24 in > pgmap v426: 4800 pgs: 4422 active+clean, 378 active+degraded; 1800 > MB data, 3789 MB used, 174 TB / 174 TB avail; 39/904 degraded (4.314%) > mdsmap e1: 0/0/1 up > > > If I now look at the pg. > > 2.33d 1 0 1 0 4194304 0 0 active+degraded 2013-03-28 > 12:16:11.049936 6'1 23'39 [18] [18] 6'1 2013-03-28 11:46:40.949643 6'1 > 2013-03-28 11:46:40.949643 > > It seems that the crush algo. doesn't find a new place for the replica. > > ceph pg 2.33d query > > { "state": "active+degraded", > "epoch": 24, > "up": [ > 18], > "acting": [ > 18], > "info": { "pgid": "2.33d", > "last_update": "6'1", > "last_complete": "6'1", > "log_tail": "0'0", > "last_backfill": "MAX", > "purged_snaps": "[]", > "history": { "epoch_created": 1, > "last_epoch_started": 24, > "last_epoch_clean": 24, > "last_epoch_split": 0, > "same_up_since": 23, > "same_interval_since": 23, > "same_primary_since": 23, > "last_scrub": "6'1", > "last_scrub_stamp": "2013-03-28 11:46:40.949643", > "last_deep_scrub": "6'1", > "last_deep_scrub_stamp": "2013-03-28 11:46:40.949643", > "last_clean_scrub_stamp": "2013-03-28 11:46:40.949643"}, > "stats": { "version": "6'1", > "reported": "23'39", > "state": "active+degraded", > "last_fresh": "2013-03-28 12:16:11.059607", > "last_change": "2013-03-28 12:16:11.049936", > "last_active": "2013-03-28 12:16:11.059607", > "last_clean": "2013-03-28 11:44:59.181618", > "last_unstale": "2013-03-28 12:16:11.059607", > "mapping_epoch": 21, > "log_start": "0'0", > "ondisk_log_start": "0'0", > "created": 1, > "last_epoch_clean": 1, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "6'1", > "last_scrub_stamp": "2013-03-28 11:46:40.949643", > "last_deep_scrub": "6'1", > "last_deep_scrub_stamp": "2013-03-28 11:46:40.949643", > "last_clean_scrub_stamp": "2013-03-28 11:46:40.949643", > "log_size": 0, > "ondisk_log_size": 0, > "stats_invalid": "0", > "stat_sum": { "num_bytes": 4194304, > "num_objects": 1, > "num_object_clones": 0, > "num_object_copies": 0, > "num_objects_missing_on_primary": 0, > "num_objects_degraded": 0, > "num_objects_unfound": 0, > "num_read": 0, > "num_read_kb": 0, > "num_write": 1, > "num_write_kb": 4096, > "num_scrub_errors": 0, > "num_objects_recovered": 2, > "num_bytes_recovered": 8388608, > "num_keys_recovered": 0}, > "stat_cat_sum": {}, > "up": [ > 18], > "acting": [ > 18]}, > "empty": 0, > "dne": 0, > "incomplete": 0, > "last_epoch_started": 24}, > "recovery_state": [ > { "name": "Started\/Primary\/Active", > "enter_time": "2013-03-28 12:16:11.049925", > "might_have_unfound": [], > "recovery_progress": { "backfill_target": -1, > "waiting_on_backfill": 0, > "backfill_pos": "0\/\/0\/\/-1", > "backfill_info": { "begin": "0\/\/0\/\/-1", > "end": "0\/\/0\/\/-1", > "objects": []}, > "peer_backfill_info": { "begin": "0\/\/0\/\/-1", > "end": "0\/\/0\/\/-1", > "objects": []}, > "backfills_in_flight": [], > "pull_from_peer": [], > "pushing": []}, > "scrub": { "scrubber.epoch_start": "0", > "scrubber.active": 0, > "scrubber.block_writes": 0, > "scrubber.finalizing": 0, > "scrubber.waiting_on": 0, > "scrubber.waiting_on_whom": []}}, > { "name": "Started", > "enter_time": "2013-03-28 12:16:10.029226"}]} > > -martin > > > On 28.03.2013 08:57, Dan van der Ster wrote: >> Shouldn't it just be: >> >> step take default >> step chooseleaf firstn 0 type rack >> step emit >> >> Like he has for data and metadata? >> >> -- >> Dan >> >> On Thu, Mar 28, 2013 at 2:51 AM, Martin Mailand <martin@xxxxxxxxxxxx> wrote: >>> Hi John, >>> >>> I still think this part in the crushmap is wrong. >>> >>> step take default >>> step choose firstn 0 type rack >>> step chooseleaf firstn 0 type host >>> step emit >>> >>> I first take from the defaut -> that's okay, >>> Now I take two from the rack -> that's still ok >>> But now, I will take 2 host in each rack, -> that would be 4 locations, >>> but I have a replication level of 2. >>> >>> Or don't I understand the placement right? >>> >>> -martin >>> >>> On 28.03.2013 02:25, John Wilkins wrote: >>>> So the OSD you shutdown is down and in. How long does it stay in the >>>> degraded state? In the docs here, >>>> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/ , we >>>> discuss the notion that a down OSD is not technically out of the >>>> cluster for awhile. I believe the default value is 300 seconds, which >>>> is about 5 minutes. From what I can see from your "ceph osd tree" >>>> command, all your OSDs are running. You can change the time it takes >>>> to mark a down OSD out. That's " mon osd down out interval", discussed >>>> in this section: >>>> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/#degraded >>>> >>>> On Wed, Mar 27, 2013 at 5:56 PM, Martin Mailand <martin@xxxxxxxxxxxx> wrote: >>>>> Hi, >>>>> >>>>> that's the config http://pastebin.com/2JzABSYt >>>>> ceph osd dump http://pastebin.com/GSCGKL1k >>>>> ceph osd tree http://pastebin.com/VSgPFRYv >>>>> >>>>> As far as I can tell they are not mapped right. >>>>> >>>>> sdmap e133 pool 'rbd' (2) object '2.31a' -> pg 2.f3caaf00 (2.300) -> up >>>>> [13,23] acting [13,23] >>>>> >>>>> -martin >>>>> >>>>> On 28.03.2013 01:09, John Wilkins wrote: >>>>>> We need a bit more information. If you can do: "ceph osd dump", "ceph >>>>>> osd tree", and paste your ceph conf, we might get a bit further. The >>>>>> CRUSH hierarchy looks okay. I can't see the replica size from this >>>>>> though. >>>>>> >>>>>> Have you followed this procedure to see if your object is getting >>>>>> remapped? http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/#finding-an-object-location >>>>>> >>>>>> On Thu, Mar 21, 2013 at 12:02 PM, Martin Mailand <martin@xxxxxxxxxxxx> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I want to change my crushmap to reflect my setup, I have two racks with >>>>>>> each 3 hosts. I want to use for the rbd pool a replication size of 2. >>>>>>> The failure domain should be the rack, so each replica should be in each >>>>>>> rack. That works so far. >>>>>>> But if I shutdown a host the clusters stays degraded, but I want that >>>>>>> the now missing replicas get replicated to the two remaining hosts in >>>>>>> this rack. >>>>>>> >>>>>>> Here is crushmap. >>>>>>> http://pastebin.com/UaB6LfKs >>>>>>> >>>>>>> Any idea what I did wrong? >>>>>>> >>>>>>> -martin >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>>>>> >>>>>> >>>> >>>> >>>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com