I set nobackfill and here is out put of query for 1 of the incomplete pgs: $ ceph pg 1.10e query { "state": "remapped", "snap_trimq": "[]", "snap_trimq_len": 0, "epoch": 465256, "up": [ 52, 23, 20 ], "acting": [ 20 ], "info": { "pgid": "1.10e", "last_update": "0'0", "last_complete": "0'0", "log_tail": "0'0", "last_user_version": 0, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 22654, "epoch_pool_created": 22654, "last_epoch_started": 447973, "last_interval_started": 447972, "last_epoch_clean": 438832, "last_interval_clean": 438831, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 465120, "same_interval_since": 465256, "same_primary_since": 465256, "last_scrub": "438490'293946", "last_scrub_stamp": "2018-06-12 00:10:55.825562", "last_deep_scrub": "427203'293886", "last_deep_scrub_stamp": "2018-06-07 01:46:27.403211", "last_clean_scrub_stamp": "2018-06-12 00:10:55.825562" }, "stats": { "version": "0'0", "reported_seq": "8479", "reported_epoch": "465256", "state": "remapped+peering", "last_fresh": "2018-06-14 11:38:54.482624", "last_change": "2018-06-14 11:38:54.471206", "last_active": "0.000000", "last_peered": "0.000000", "last_clean": "0.000000", "last_became_active": "0.000000", "last_became_peered": "0.000000", "last_unstale": "2018-06-14 11:38:54.482624", "last_undegraded": "2018-06-14 11:38:54.482624", "last_fullsized": "2018-06-14 11:38:54.482624", "mapping_epoch": 465256, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 22654, "last_epoch_clean": 438832, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "438490'293946", "last_scrub_stamp": "2018-06-12 00:10:55.825562", "last_deep_scrub": "427203'293886", "last_deep_scrub_stamp": "2018-06-07 01:46:27.403211", "last_clean_scrub_stamp": "2018-06-12 00:10:55.825562", "log_size": 0, "ondisk_log_size": 0, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "snaptrimq_len": 0, "stat_sum": { "num_bytes": 0, "num_objects": 0, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 0, "num_whiteouts": 0, "num_read": 0, "num_read_kb": 0, "num_write": 0, "num_write_kb": 0, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 0, "num_bytes_recovered": 0, "num_keys_recovered": 0, "num_objects_omap": 0, "num_objects_hit_set_archive": 0, "num_bytes_hit_set_archive": 0, "num_flush": 0, "num_flush_kb": 0, "num_evict": 0, "num_evict_kb": 0, "num_promote": 0, "num_flush_mode_high": 0, "num_flush_mode_low": 0, "num_evict_mode_some": 0, "num_evict_mode_full": 0, "num_objects_pinned": 0, "num_legacy_snapsets": 0 }, "up": [ 52, 23, 20 ], "acting": [ 20 ], "blocked_by": [], "up_primary": 52, "acting_primary": 20 }, "empty": 1, "dne": 0, "incomplete": 0, "last_epoch_started": 0, "hit_set_history": { "current_last_update": "0'0", "history": [] } }, "peer_info": [], "recovery_state": [ { "name": "Started/Primary/Peering/WaitActingChange", "enter_time": "2018-06-14 11:38:54.482696", "comment": "waiting for pg acting set to change" }, { "name": "Started", "enter_time": "2018-06-14 11:38:54.471136" } ], "agent_state": {} } On Thu, Jun 14, 2018 at 11:36 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Thu, 14 Jun 2018, Wyllys Ingersoll wrote: >> Ceph Luminous 12.2.5 with filestore OSDs >> >> I have a cluster that had a bunch of disks removed due to failures and >> hardware problems. At this point, after about a few day of >> rebalancing and attempting to get healthy, it still has 16 incomplete >> pgs that I cannot seem to get fixed. > > Rebalancing generally won't help peering; it's often easiest to tell > what's going on if you temporarily set nobackfill and just focus on > getting all of the PGs peered and/or active. > >> I've tried moving some of the pgs to other osds using the >> ceph-objecstore-tool. I've restarted some of the osds. Ive tried all >> of the tricks I could find online for clearing these issues but they >> persist. >> >> One problem appears to be that a lot of the osds are stuck or blocked >> waiting for osds that no longer exist in the crush map. 'ceph osd >> blocked-by' shows many osds that are not in the cluster anymore. Is >> there anyway to force the osds that are stuck waiting for non-existent >> osds to move on and drop them from their list ? Even restarting them >> does not fix the issue. Is it a bug that osds are blocking on >> non-existent osds? >> >> >> OBJECT_MISPLACED 610200/41085645 objects misplaced (1.485%) >> PG_AVAILABILITY Reduced data availability: 16 pgs inactive, 3 pgs >> peering, 13 pgs incomplete > > The incomplete or peering PGs are the ones to focus on. Can you attach > the result of a 'ceph tell <pgid> query'? > > sage > > >> pg 1.10e is stuck peering since forever, current state peering, >> last acting [52,23,20] >> pg 1.12a is incomplete, acting [27,63,53] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.20b is incomplete, acting [84,59,18] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.24f is incomplete, acting [13,23,19] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.25c is incomplete, acting [23,52,60] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.2bd is incomplete, acting [59,53,19] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.2e4 is incomplete, acting [67,22,6] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.2fd is stuck peering since forever, current state peering, >> last acting [79,53,58] >> pg 1.390 is incomplete, acting [81,18,2] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.482 is incomplete, acting [1,53,90] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.504 is incomplete, acting [59,96,53] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.688 is incomplete, acting [36,53,49] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.6dd is incomplete, acting [47,56,12] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.703 is incomplete, acting [47,2,51] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> pg 1.7a2 is stuck peering since forever, current state peering, >> last acting [18,82,3] >> pg 1.7b4 is incomplete, acting [92,49,96] (reducing pool >> cephfs_data min_size from 2 may help; search ceph.com/docs for >> 'incomplete') >> PG_DEGRADED Degraded data redundancy: 13439/41085645 objects degraded >> (0.033%), 2 pgs degraded, 2 pgs undersized >> pg 1.74 is stuck undersized for 620.126459, current state >> active+undersized+degraded+remapped+backfill_wait, last acting [17,6] >> pg 1.527 is stuck undersized for 712.173611, current state >> active+undersized+degraded+remapped+backfill_wait, last acting [63,86] >> REQUEST_SLOW 2 slow requests are blocked > 32 sec >> 2 ops are blocked > 2097.15 sec >> osd.18 has blocked requests > 2097.15 sec >> REQUEST_STUCK 63 stuck requests are blocked > 4096 sec >> 22 ops are blocked > 134218 sec >> 2 ops are blocked > 67108.9 sec >> 28 ops are blocked > 8388.61 sec >> 11 ops are blocked > 4194.3 sec >> osds 23,92 have stuck requests > 4194.3 sec >> osds 59,81 have stuck requests > 8388.61 sec >> osd.13 has stuck requests > 67108.9 sec >> osds 1,36,47,67,84 have stuck requests > 134218 sec >> >> >> Any help would be much appreciated. >> >> Wyllys Ingersoll >> Keeper Technology, LLC >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html