Good morning, some days ago we created a new pool with 512 pgs, and originally 5 osds. We use the device class "ssd" and a crush rule that maps all data for the pool "ssd" to the ssd device class osds. While creating, one of the ssds failed and we are left with 4 osds: [10:00:22] server2.place6:/var/log/ceph# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 135.12505 root default -7 51.36911 host server2 15 hdd-big 9.09511 osd.15 up 1.00000 1.00000 20 hdd-big 9.09511 osd.20 up 1.00000 1.00000 21 hdd-big 9.09511 osd.21 up 1.00000 1.00000 7 hdd-small 4.54776 osd.7 up 1.00000 1.00000 8 hdd-small 4.54776 osd.8 up 1.00000 1.00000 10 hdd-small 4.54776 osd.10 up 1.00000 1.00000 26 hdd-small 4.54776 osd.26 up 1.00000 1.00000 14 notinuse 5.45741 osd.14 up 1.00000 1.00000 12 ssd 0.21767 osd.12 up 1.00000 1.00000 24 ssd 0.21767 osd.24 up 1.00000 1.00000 -5 42.50967 host server3 9 hdd-big 9.09511 osd.9 up 1.00000 1.00000 16 hdd-big 9.09511 osd.16 up 1.00000 1.00000 19 hdd-big 9.09511 osd.19 up 1.00000 1.00000 3 hdd-small 4.54776 osd.3 up 1.00000 1.00000 5 hdd-small 4.54776 osd.5 up 1.00000 1.00000 6 hdd-small 4.54776 osd.6 up 1.00000 1.00000 11 notinuse 0.45424 osd.11 up 1.00000 1.00000 13 notinuse 0.90907 osd.13 up 1.00000 1.00000 25 ssd 0.21776 osd.25 up 1.00000 1.00000 -2 41.24626 host server4 2 hdd-big 9.09511 osd.2 up 1.00000 1.00000 17 hdd-big 9.09511 osd.17 up 1.00000 1.00000 18 hdd-big 9.09511 osd.18 up 1.00000 1.00000 0 hdd-small 4.54776 osd.0 up 1.00000 1.00000 1 hdd-small 4.54776 osd.1 up 1.00000 1.00000 22 hdd-small 4.54776 osd.22 up 1.00000 1.00000 4 notinuse 0.09999 osd.4 up 1.00000 1.00000 23 ssd 0.21767 osd.23 up 1.00000 1.00000 [10:04:27] server2.place6:/var/log/ceph# We first had about 160 pgs stuck in creating+activating. After restarting all osds in the ssd class one by one, it shifted to 100 activating and 60 creating+activating: [10:00:18] server2.place6:/var/log/ceph# ceph -s cluster: id: 1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_ERR 1803200/13770981 objects misplaced (13.094%) Reduced data availability: 175 pgs inactive Degraded data redundancy: 857547/13770981 objects degraded (6.227%), 197 pgs degraded, 123 pgs undersized 39 slow requests are blocked > 32 sec 40 stuck requests are blocked > 4096 sec services: mon: 3 daemons, quorum black1,black2,black3 mgr: black3(active), standbys: black2, black1 osd: 27 osds: 27 up, 27 in; 156 remapped pgs data: pools: 2 pools, 1024 pgs objects: 4482k objects, 17725 GB usage: 55542 GB used, 83188 GB / 135 TB avail pgs: 17.090% pgs not active 857547/13770981 objects degraded (6.227%) 1803200/13770981 objects misplaced (13.094%) 640 active+clean 105 active+undersized+degraded+remapped+backfill_wait 100 activating 60 creating+activating 50 active+recovery_wait+degraded 21 active+remapped+backfill_wait 16 active+recovery_wait+undersized+degraded+remapped 15 activating+degraded 9 active+recovery_wait+degraded+remapped 3 active+recovery_wait+remapped 3 active+recovery_wait 2 active+undersized+degraded+remapped+backfilling io: client: 519 kB/s rd, 38025 kB/s wr, 4 op/s rd, 20 op/s wr recovery: 1694 kB/s, 0 objects/s I looked into the archives, but did not find anything that directly related to our situation. We are using ceph 12.2.4. An excerpt from our ceph health detail looks like this: HEALTH_ERR 1803116/13770981 objects misplaced (13.094%); Reduced data availability: 175 pgs inactive; Degraded data redundancy: 856881/13770981 objects degraded (6.222%), 197 pgs degraded, 123 pgs undersized; 53 slow requests are blocked > 32 sec; 40 stuck requests are blocked > 4096 sec OBJECT_MISPLACED 1803116/13770981 objects misplaced (13.094%) PG_AVAILABILITY Reduced data availability: 175 pgs inactive pg 7.118 is stuck inactive for 183000.110669, current state creating+activating, last acting [12,23,25] pg 7.11a is stuck inactive for 38143.679989, current state activating, last acting [25,24,23] pg 7.121 is stuck inactive for 38143.670149, current state activating, last acting [25,23,12] pg 7.123 is stuck inactive for 37184.100764, current state activating+degraded, last acting [25,12,23] pg 7.125 is stuck inactive for 38143.677390, current state activating, last acting [25,24,23] pg 7.126 is stuck inactive for 38164.127082, current state activating, last acting [24,23,25] pg 7.127 is stuck inactive for 183000.110669, current state creating+activating, last acting [12,23,25] pg 7.12b is stuck inactive for 183000.110669, current state creating+activating, last acting [12,23,25] where pool 7 is the ssd pool. The pg query of 7.118 looks as follows: { "state": "creating+activating", "snap_trimq": "[1~5]", "snap_trimq_len": 5, "epoch": 5016, "up": [ 12, 23, 25 ], "acting": [ 12, 23, 25 ], "actingbackfill": [ "12", "23", "25" ], "info": { "pgid": "7.118", "last_update": "0'0", "last_complete": "0'0", "log_tail": "0'0", "last_user_version": 0, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 4620, "epoch_pool_created": 4620, "last_epoch_started": 0, "last_interval_started": 0, "last_epoch_clean": 0, "last_interval_clean": 0, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 4967, "same_interval_since": 4967, "same_primary_since": 4967, "last_scrub": "0'0", "last_scrub_stamp": "2018-03-15 07:18:46.197892", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892", "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892" }, "stats": { "version": "0'0", "reported_seq": "406", "reported_epoch": "5016", "state": "creating+activating", "last_fresh": "2018-03-17 10:12:58.380048", "last_change": "2018-03-17 10:10:24.335405", "last_active": "2018-03-15 07:18:46.197892", "last_peered": "2018-03-15 07:18:46.197892", "last_clean": "2018-03-15 07:18:46.197892", "last_became_active": "0.000000", "last_became_peered": "0.000000", "last_unstale": "2018-03-17 10:12:58.380048", "last_undegraded": "2018-03-17 10:12:58.380048", "last_fullsized": "2018-03-17 10:12:58.380048", "mapping_epoch": 4967, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 4620, "last_epoch_clean": 0, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "0'0", "last_scrub_stamp": "2018-03-15 07:18:46.197892", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892", "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892", "log_size": 0, "ondisk_log_size": 0, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "snaptrimq_len": 5, "stat_sum": { "num_bytes": 0, "num_objects": 0, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 0, "num_whiteouts": 0, "num_read": 0, "num_read_kb": 0, "num_write": 0, "num_write_kb": 0, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 0, "num_bytes_recovered": 0, "num_keys_recovered": 0, "num_objects_omap": 0, "num_objects_hit_set_archive": 0, "num_bytes_hit_set_archive": 0, "num_flush": 0, "num_flush_kb": 0, "num_evict": 0, "num_evict_kb": 0, "num_promote": 0, "num_flush_mode_high": 0, "num_flush_mode_low": 0, "num_evict_mode_some": 0, "num_evict_mode_full": 0, "num_objects_pinned": 0, "num_legacy_snapsets": 0 }, "up": [ 12, 23, 25 ], "acting": [ 12, 23, 25 ], "blocked_by": [], "up_primary": 12, "acting_primary": 12 }, "empty": 1, "dne": 0, "incomplete": 0, "last_epoch_started": 4968, "hit_set_history": { "current_last_update": "0'0", "history": [] } }, "peer_info": [ { "peer": "23", "pgid": "7.118", "last_update": "0'0", "last_complete": "0'0", "log_tail": "0'0", "last_user_version": 0, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 0, "epoch_pool_created": 0, "last_epoch_started": 0, "last_interval_started": 0, "last_epoch_clean": 0, "last_interval_clean": 0, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 0, "same_interval_since": 0, "same_primary_since": 0, "last_scrub": "0'0", "last_scrub_stamp": "0.000000", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "0.000000", "last_clean_scrub_stamp": "0.000000" }, "stats": { "version": "0'0", "reported_seq": "0", "reported_epoch": "0", "state": "unknown", "last_fresh": "0.000000", "last_change": "0.000000", "last_active": "0.000000", "last_peered": "0.000000", "last_clean": "0.000000", "last_became_active": "0.000000", "last_became_peered": "0.000000", "last_unstale": "0.000000", "last_undegraded": "0.000000", "last_fullsized": "0.000000", "mapping_epoch": 0, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 0, "last_epoch_clean": 0, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "0'0", "last_scrub_stamp": "0.000000", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "0.000000", "last_clean_scrub_stamp": "0.000000", "log_size": 0, "ondisk_log_size": 0, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "snaptrimq_len": 0, "stat_sum": { "num_bytes": 0, "num_objects": 0, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 0, "num_whiteouts": 0, "num_read": 0, "num_read_kb": 0, "num_write": 0, "num_write_kb": 0, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 0, "num_bytes_recovered": 0, "num_keys_recovered": 0, "num_objects_omap": 0, "num_objects_hit_set_archive": 0, "num_bytes_hit_set_archive": 0, "num_flush": 0, "num_flush_kb": 0, "num_evict": 0, "num_evict_kb": 0, "num_promote": 0, "num_flush_mode_high": 0, "num_flush_mode_low": 0, "num_evict_mode_some": 0, "num_evict_mode_full": 0, "num_objects_pinned": 0, "num_legacy_snapsets": 0 }, "up": [], "acting": [], "blocked_by": [], "up_primary": -1, "acting_primary": -1 }, "empty": 1, "dne": 1, "incomplete": 0, "last_epoch_started": 0, "hit_set_history": { "current_last_update": "0'0", "history": [] } }, { "peer": "24", "pgid": "7.118", "last_update": "0'0", "last_complete": "0'0", "log_tail": "0'0", "last_user_version": 0, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 4620, "epoch_pool_created": 4620, "last_epoch_started": 0, "last_interval_started": 0, "last_epoch_clean": 0, "last_interval_clean": 0, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 4967, "same_interval_since": 4967, "same_primary_since": 4967, "last_scrub": "0'0", "last_scrub_stamp": "2018-03-15 07:18:46.197892", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892", "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892" }, "stats": { "version": "0'0", "reported_seq": "164", "reported_epoch": "4769", "state": "creating+remapped+peering", "last_fresh": "2018-03-16 23:49:04.258780", "last_change": "2018-03-16 23:49:03.296077", "last_active": "2018-03-15 07:18:46.197892", "last_peered": "2018-03-15 07:18:46.197892", "last_clean": "2018-03-15 07:18:46.197892", "last_became_active": "0.000000", "last_became_peered": "0.000000", "last_unstale": "2018-03-16 23:49:04.258780", "last_undegraded": "2018-03-16 23:49:04.258780", "last_fullsized": "2018-03-16 23:49:04.258780", "mapping_epoch": 4967, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 4620, "last_epoch_clean": 0, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "0'0", "last_scrub_stamp": "2018-03-15 07:18:46.197892", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2018-03-15 07:18:46.197892", "last_clean_scrub_stamp": "2018-03-15 07:18:46.197892", "log_size": 0, "ondisk_log_size": 0, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "snaptrimq_len": 0, "stat_sum": { "num_bytes": 0, "num_objects": 0, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 0, "num_whiteouts": 0, "num_read": 0, "num_read_kb": 0, "num_write": 0, "num_write_kb": 0, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 0, "num_bytes_recovered": 0, "num_keys_recovered": 0, "num_objects_omap": 0, "num_objects_hit_set_archive": 0, "num_bytes_hit_set_archive": 0, "num_flush": 0, "num_flush_kb": 0, "num_evict": 0, "num_evict_kb": 0, "num_promote": 0, "num_flush_mode_high": 0, "num_flush_mode_low": 0, "num_evict_mode_some": 0, "num_evict_mode_full": 0, "num_objects_pinned": 0, "num_legacy_snapsets": 0 }, "up": [ 12, 23, 25 ], "acting": [ 12, 23, 25 ], "blocked_by": [], "up_primary": 12, "acting_primary": 12 }, "empty": 1, "dne": 0, "incomplete": 0, "last_epoch_started": 4769, "hit_set_history": { "current_last_update": "0'0", "history": [] } }, { "peer": "25", "pgid": "7.118", "last_update": "0'0", "last_complete": "0'0", "log_tail": "0'0", "last_user_version": 0, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 0, "epoch_pool_created": 0, "last_epoch_started": 0, "last_interval_started": 0, "last_epoch_clean": 0, "last_interval_clean": 0, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 0, "same_interval_since": 0, "same_primary_since": 0, "last_scrub": "0'0", "last_scrub_stamp": "0.000000", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "0.000000", "last_clean_scrub_stamp": "0.000000" }, "stats": { "version": "0'0", "reported_seq": "0", "reported_epoch": "0", "state": "unknown", "last_fresh": "0.000000", "last_change": "0.000000", "last_active": "0.000000", "last_peered": "0.000000", "last_clean": "0.000000", "last_became_active": "0.000000", "last_became_peered": "0.000000", "last_unstale": "0.000000", "last_undegraded": "0.000000", "last_fullsized": "0.000000", "mapping_epoch": 0, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 0, "last_epoch_clean": 0, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "0'0", "last_scrub_stamp": "0.000000", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "0.000000", "last_clean_scrub_stamp": "0.000000", "log_size": 0, "ondisk_log_size": 0, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "snaptrimq_len": 0, "stat_sum": { "num_bytes": 0, "num_objects": 0, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 0, "num_whiteouts": 0, "num_read": 0, "num_read_kb": 0, "num_write": 0, "num_write_kb": 0, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 0, "num_bytes_recovered": 0, "num_keys_recovered": 0, "num_objects_omap": 0, "num_objects_hit_set_archive": 0, "num_bytes_hit_set_archive": 0, "num_flush": 0, "num_flush_kb": 0, "num_evict": 0, "num_evict_kb": 0, "num_promote": 0, "num_flush_mode_high": 0, "num_flush_mode_low": 0, "num_evict_mode_some": 0, "num_evict_mode_full": 0, "num_objects_pinned": 0, "num_legacy_snapsets": 0 }, "up": [], "acting": [], "blocked_by": [], "up_primary": -1, "acting_primary": -1 }, "empty": 1, "dne": 1, "incomplete": 0, "last_epoch_started": 0, "hit_set_history": { "current_last_update": "0'0", "history": [] } } ], "recovery_state": [ { "name": "Started/Primary/Active", "enter_time": "2018-03-17 10:10:24.335124", "might_have_unfound": [ { "osd": "24", "status": "not queried" } ], "recovery_progress": { "backfill_targets": [], "waiting_on_backfill": [], "last_backfill_started": "MIN", "backfill_info": { "begin": "MIN", "end": "MIN", "objects": [] }, "peer_backfill_info": [], "backfills_in_flight": [], "recovering": [], "pg_backend": { "pull_from_peer": [], "pushing": [] } }, "scrub": { "scrubber.epoch_start": "0", "scrubber.active": false, "scrubber.state": "INACTIVE", "scrubber.start": "MIN", "scrubber.end": "MIN", "scrubber.subset_last_update": "0'0", "scrubber.deep": false, "scrubber.seed": 0, "scrubber.waiting_on": 0, "scrubber.waiting_on_whom": [] } }, { "name": "Started", "enter_time": "2018-03-17 10:10:23.373097" } ], "agent_state": {} } If anyone has a hint on why it is stuck in creation, it would be very much appreciated. Best, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com