Re: incomplete pgs - cannot clear

Wyllys Ingersoll <wyllys.ingersoll@xxxxxxxxxxxxxx> · Thu, 14 Jun 2018 11:40:54 -0400

I set nobackfill and here is out put of query for 1 of the incomplete pgs:

$ ceph pg 1.10e query
{
    "state": "remapped",
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "epoch": 465256,
    "up": [
        52,
        23,
        20
    ],
    "acting": [
        20
    ],
    "info": {
        "pgid": "1.10e",
        "last_update": "0'0",
        "last_complete": "0'0",
        "log_tail": "0'0",
        "last_user_version": 0,
        "last_backfill": "MAX",
        "last_backfill_bitwise": 0,
        "purged_snaps": [],
        "history": {
            "epoch_created": 22654,
            "epoch_pool_created": 22654,
            "last_epoch_started": 447973,
            "last_interval_started": 447972,
            "last_epoch_clean": 438832,
            "last_interval_clean": 438831,
            "last_epoch_split": 0,
            "last_epoch_marked_full": 0,
            "same_up_since": 465120,
            "same_interval_since": 465256,
            "same_primary_since": 465256,
            "last_scrub": "438490'293946",
            "last_scrub_stamp": "2018-06-12 00:10:55.825562",
            "last_deep_scrub": "427203'293886",
            "last_deep_scrub_stamp": "2018-06-07 01:46:27.403211",
            "last_clean_scrub_stamp": "2018-06-12 00:10:55.825562"
        },
        "stats": {
            "version": "0'0",
            "reported_seq": "8479",
            "reported_epoch": "465256",
            "state": "remapped+peering",
            "last_fresh": "2018-06-14 11:38:54.482624",
            "last_change": "2018-06-14 11:38:54.471206",
            "last_active": "0.000000",
            "last_peered": "0.000000",
            "last_clean": "0.000000",
            "last_became_active": "0.000000",
            "last_became_peered": "0.000000",
            "last_unstale": "2018-06-14 11:38:54.482624",
            "last_undegraded": "2018-06-14 11:38:54.482624",
            "last_fullsized": "2018-06-14 11:38:54.482624",
            "mapping_epoch": 465256,
            "log_start": "0'0",
            "ondisk_log_start": "0'0",
            "created": 22654,
            "last_epoch_clean": 438832,
            "parent": "0.0",
            "parent_split_bits": 0,
            "last_scrub": "438490'293946",
            "last_scrub_stamp": "2018-06-12 00:10:55.825562",
            "last_deep_scrub": "427203'293886",
            "last_deep_scrub_stamp": "2018-06-07 01:46:27.403211",
            "last_clean_scrub_stamp": "2018-06-12 00:10:55.825562",
            "log_size": 0,
            "ondisk_log_size": 0,
            "stats_invalid": false,
            "dirty_stats_invalid": false,
            "omap_stats_invalid": false,
            "hitset_stats_invalid": false,
            "hitset_bytes_stats_invalid": false,
            "pin_stats_invalid": false,
            "snaptrimq_len": 0,
            "stat_sum": {
                "num_bytes": 0,
                "num_objects": 0,
                "num_object_clones": 0,
                "num_object_copies": 0,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 0,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 0,
                "num_whiteouts": 0,
                "num_read": 0,
                "num_read_kb": 0,
                "num_write": 0,
                "num_write_kb": 0,
                "num_scrub_errors": 0,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 0,
                "num_objects_recovered": 0,
                "num_bytes_recovered": 0,
                "num_keys_recovered": 0,
                "num_objects_omap": 0,
                "num_objects_hit_set_archive": 0,
                "num_bytes_hit_set_archive": 0,
                "num_flush": 0,
                "num_flush_kb": 0,
                "num_evict": 0,
                "num_evict_kb": 0,
                "num_promote": 0,
                "num_flush_mode_high": 0,
                "num_flush_mode_low": 0,
                "num_evict_mode_some": 0,
                "num_evict_mode_full": 0,
                "num_objects_pinned": 0,
                "num_legacy_snapsets": 0
            },
            "up": [
                52,
                23,
                20
            ],
            "acting": [
                20
            ],
            "blocked_by": [],
            "up_primary": 52,
            "acting_primary": 20
        },
        "empty": 1,
        "dne": 0,
        "incomplete": 0,
        "last_epoch_started": 0,
        "hit_set_history": {
            "current_last_update": "0'0",
            "history": []
        }
    },
    "peer_info": [],
    "recovery_state": [
        {
            "name": "Started/Primary/Peering/WaitActingChange",
            "enter_time": "2018-06-14 11:38:54.482696",
            "comment": "waiting for pg acting set to change"
        },
        {
            "name": "Started",
            "enter_time": "2018-06-14 11:38:54.471136"
        }
    ],
    "agent_state": {}
}

On Thu, Jun 14, 2018 at 11:36 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 14 Jun 2018, Wyllys Ingersoll wrote:
>> Ceph Luminous 12.2.5 with filestore OSDs
>>
>> I have a cluster that had a bunch of disks removed due to failures and
>> hardware problems.  At this point, after about a few day of
>> rebalancing and attempting to get healthy, it still has 16 incomplete
>> pgs that I cannot seem to get fixed.
>
> Rebalancing generally won't help peering; it's often easiest to tell
> what's going on if you temporarily set nobackfill and just focus on
> getting all of the PGs peered and/or active.
>
>> I've tried moving some of the pgs to other osds using the
>> ceph-objecstore-tool.  I've restarted some of the osds.  Ive tried all
>> of the tricks I could find online for clearing these issues but they
>> persist.
>>
>> One problem appears to be that a lot of the osds are stuck or blocked
>> waiting for osds that no longer exist in the crush map.  'ceph osd
>> blocked-by' shows many osds that are not in the cluster anymore.  Is
>> there anyway to force the osds that are stuck waiting for non-existent
>> osds to move on and drop them from their list ?  Even restarting them
>> does not fix the issue.  Is it a bug that osds are blocking on
>> non-existent osds?
>>
>>
>> OBJECT_MISPLACED 610200/41085645 objects misplaced (1.485%)
>> PG_AVAILABILITY Reduced data availability: 16 pgs inactive, 3 pgs
>> peering, 13 pgs incomplete
>
> The incomplete or peering PGs are the ones to focus on.  Can you attach
> the result of a 'ceph tell <pgid> query'?
>
> sage
>
>
>>     pg 1.10e is stuck peering since forever, current state peering,
>> last acting [52,23,20]
>>     pg 1.12a is incomplete, acting [27,63,53] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.20b is incomplete, acting [84,59,18] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.24f is incomplete, acting [13,23,19] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.25c is incomplete, acting [23,52,60] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.2bd is incomplete, acting [59,53,19] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.2e4 is incomplete, acting [67,22,6] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.2fd is stuck peering since forever, current state peering,
>> last acting [79,53,58]
>>     pg 1.390 is incomplete, acting [81,18,2] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.482 is incomplete, acting [1,53,90] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.504 is incomplete, acting [59,96,53] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.688 is incomplete, acting [36,53,49] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.6dd is incomplete, acting [47,56,12] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.703 is incomplete, acting [47,2,51] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>>     pg 1.7a2 is stuck peering since forever, current state peering,
>> last acting [18,82,3]
>>     pg 1.7b4 is incomplete, acting [92,49,96] (reducing pool
>> cephfs_data min_size from 2 may help; search ceph.com/docs for
>> 'incomplete')
>> PG_DEGRADED Degraded data redundancy: 13439/41085645 objects degraded
>> (0.033%), 2 pgs degraded, 2 pgs undersized
>>     pg 1.74 is stuck undersized for 620.126459, current state
>> active+undersized+degraded+remapped+backfill_wait, last acting [17,6]
>>     pg 1.527 is stuck undersized for 712.173611, current state
>> active+undersized+degraded+remapped+backfill_wait, last acting [63,86]
>> REQUEST_SLOW 2 slow requests are blocked > 32 sec
>>     2 ops are blocked > 2097.15 sec
>>     osd.18 has blocked requests > 2097.15 sec
>> REQUEST_STUCK 63 stuck requests are blocked > 4096 sec
>>     22 ops are blocked > 134218 sec
>>     2 ops are blocked > 67108.9 sec
>>     28 ops are blocked > 8388.61 sec
>>     11 ops are blocked > 4194.3 sec
>>     osds 23,92 have stuck requests > 4194.3 sec
>>     osds 59,81 have stuck requests > 8388.61 sec
>>     osd.13 has stuck requests > 67108.9 sec
>>     osds 1,36,47,67,84 have stuck requests > 134218 sec
>>
>>
>> Any help would be much appreciated.
>>
>> Wyllys Ingersoll
>> Keeper Technology, LLC
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html