On Thu, 14 Jun 2018, Wyllys Ingersoll wrote: > Ceph Luminous 12.2.5 with filestore OSDs > > I have a cluster that had a bunch of disks removed due to failures and > hardware problems. At this point, after about a few day of > rebalancing and attempting to get healthy, it still has 16 incomplete > pgs that I cannot seem to get fixed. Rebalancing generally won't help peering; it's often easiest to tell what's going on if you temporarily set nobackfill and just focus on getting all of the PGs peered and/or active. > I've tried moving some of the pgs to other osds using the > ceph-objecstore-tool. I've restarted some of the osds. Ive tried all > of the tricks I could find online for clearing these issues but they > persist. > > One problem appears to be that a lot of the osds are stuck or blocked > waiting for osds that no longer exist in the crush map. 'ceph osd > blocked-by' shows many osds that are not in the cluster anymore. Is > there anyway to force the osds that are stuck waiting for non-existent > osds to move on and drop them from their list ? Even restarting them > does not fix the issue. Is it a bug that osds are blocking on > non-existent osds? > > > OBJECT_MISPLACED 610200/41085645 objects misplaced (1.485%) > PG_AVAILABILITY Reduced data availability: 16 pgs inactive, 3 pgs > peering, 13 pgs incomplete The incomplete or peering PGs are the ones to focus on. Can you attach the result of a 'ceph tell <pgid> query'? sage > pg 1.10e is stuck peering since forever, current state peering, > last acting [52,23,20] > pg 1.12a is incomplete, acting [27,63,53] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.20b is incomplete, acting [84,59,18] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.24f is incomplete, acting [13,23,19] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.25c is incomplete, acting [23,52,60] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.2bd is incomplete, acting [59,53,19] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.2e4 is incomplete, acting [67,22,6] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.2fd is stuck peering since forever, current state peering, > last acting [79,53,58] > pg 1.390 is incomplete, acting [81,18,2] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.482 is incomplete, acting [1,53,90] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.504 is incomplete, acting [59,96,53] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.688 is incomplete, acting [36,53,49] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.6dd is incomplete, acting [47,56,12] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.703 is incomplete, acting [47,2,51] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > pg 1.7a2 is stuck peering since forever, current state peering, > last acting [18,82,3] > pg 1.7b4 is incomplete, acting [92,49,96] (reducing pool > cephfs_data min_size from 2 may help; search ceph.com/docs for > 'incomplete') > PG_DEGRADED Degraded data redundancy: 13439/41085645 objects degraded > (0.033%), 2 pgs degraded, 2 pgs undersized > pg 1.74 is stuck undersized for 620.126459, current state > active+undersized+degraded+remapped+backfill_wait, last acting [17,6] > pg 1.527 is stuck undersized for 712.173611, current state > active+undersized+degraded+remapped+backfill_wait, last acting [63,86] > REQUEST_SLOW 2 slow requests are blocked > 32 sec > 2 ops are blocked > 2097.15 sec > osd.18 has blocked requests > 2097.15 sec > REQUEST_STUCK 63 stuck requests are blocked > 4096 sec > 22 ops are blocked > 134218 sec > 2 ops are blocked > 67108.9 sec > 28 ops are blocked > 8388.61 sec > 11 ops are blocked > 4194.3 sec > osds 23,92 have stuck requests > 4194.3 sec > osds 59,81 have stuck requests > 8388.61 sec > osd.13 has stuck requests > 67108.9 sec > osds 1,36,47,67,84 have stuck requests > 134218 sec > > > Any help would be much appreciated. > > Wyllys Ingersoll > Keeper Technology, LLC > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html