[Ceph incident] PG stuck in peering.

"HARROUIN Loan (PRESTATAIRE CA-GIP)" <loan.harrouin-prestataire@xxxxxxxxx> · Mon, 16 Sep 2024 17:33:03 +0000

Hello dear ceph community,

We are facing a strange issue this weekend with a pg (13.6a) that is stuck in peering. Because of that we got lot of ops stuck of course.
We are running a ceph in Pacific version 16.2.10, we have only SSD disk and are using erasure coding.

  cluster:
    id:     f5c69b4a-89e0-4055-95f7-eddc6800d4fe
    health: HEALTH_WARN
            Reduced data availability: 1 pg inactive, 1 pg peering
            256 slow ops, oldest one blocked for 5274 sec, osd.20 has slow ops
  services:
    mon: 3 daemons, quorum cos1-dal-ceph-mon-01,cos1-dal-ceph-mon-02,cos1-dal-ceph-mon-03 (age 17h)
    mgr: cos1-dal-ceph-mon-02(active, since 17h), standbys: cos1-dal-ceph-mon-03, cos1-dal-ceph-mon-01
    osd: 647 osds: 646 up (since 27m), 643 in (since 2h)
  data:
    pools:   7 pools, 1921 pgs
    objects: 432.65M objects, 1.6 PiB
    usage:   2.4 PiB used, 2.0 PiB / 4.4 PiB avail
    pgs:     0.052% pgs not active
             1916 active+clean
             2    active+clean+scrubbing
             2    active+clean+scrubbing+deep
             1    peering
The ‘ceph pg 13.6a query’ hung, so we must restart one of the osd that are part of this PG to temporary unhung the query (because during some seconds the pg isn’t peering yet). In that case, the query only retrieves the information about the shard that was hosted on the OSD that we restart.
The result of the query is in attachment (shard 0).

First when the issue occurs, we check the logs and restart all the osd linked to this PG.
Sadly, it didn’t fix anything. We try to investigate the peering state to understand what was going on the primary OSD. We put the OSD in debug but at first glance anything seems strange (we are not use to deep dive that much into ceph).

We find that CERN faced something similar a long time ago: https://indico.cern.ch/event/617118/contributions/2490930/attachments/1422793/2181063/ceph_hep_stuck_pg.pdf
After reading it, we try to do the empty OSD method that they tried (diapo7). We identify that the shard0 seem in a weird state (and was primary) so it was our candidate. We wipe the OSD 11, 148 and 280 (one by one and waiting of course the peering to avoid data loss on other PGs).
After that, the OSD.20 was now elected as the primary but still, the PG stay huge in peering and now all OPS are stuck on OSD.20.

We are now in the dark. We plan to maybe deep dive deeper into the log of this new OSD.20, and see if we can plan to upgrade our ceph in order to have the most recent version.
Any help or suggestion is welcome 😊

ceph pg 13.6a query
{
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "state": "down",
    "epoch": 794998,
    "up": [
        20,
        253,
        254,
        84,
        2147483647,
        56
    ],
    "acting": [
        20,
        253,
        254,
        84,
        2147483647,
        56
    ],
    "info": {
        "pgid": "13.6as0",
        "last_update": "0'0",
        "last_complete": "0'0",
        "log_tail": "0'0",
        "last_user_version": 0,
        "last_backfill": "MAX",
        "purged_snaps": [],
        "history": {
            "epoch_created": 34867,
            "epoch_pool_created": 4002,
            "last_epoch_started": 792996,
            "last_interval_started": 792995,
            "last_epoch_clean": 791031,
            "last_interval_clean": 791030,
            "last_epoch_split": 90048,
            "last_epoch_marked_full": 0,
            "same_up_since": 794998,
            "same_interval_since": 794998,
            "same_primary_since": 794896,
            "last_scrub": "791841'1303832763",
            "last_scrub_stamp": "2024-09-14T15:47:43.149821+0200",
            "last_deep_scrub": "781724'1296647995",
            "last_deep_scrub_stamp": "2024-09-09T03:41:56.778209+0200",
            "last_clean_scrub_stamp": "2024-09-14T15:47:43.149821+0200",
            "prior_readable_until_ub": 0
        },
        "stats": {
            "version": "0'0",
            "reported_seq": 125,
            "reported_epoch": 794998,
            "state": "down",
            "last_fresh": "2024-09-16T17:48:13.572421+0200",
            "last_change": "2024-09-16T17:48:13.572421+0200",
            "last_active": "0.000000",
            "last_peered": "0.000000",
            "last_clean": "0.000000",
            "last_became_active": "0.000000",
            "last_became_peered": "0.000000",
            "last_unstale": "2024-09-16T17:48:13.572421+0200",
            "last_undegraded": "2024-09-16T17:48:13.572421+0200",
            "last_fullsized": "2024-09-16T17:48:13.572421+0200",
            "mapping_epoch": 794998,
            "log_start": "0'0",
            "ondisk_log_start": "0'0",
            "created": 34867,
            "last_epoch_clean": 791031,
            "parent": "0.0",
            "parent_split_bits": 0,
            "last_scrub": "791841'1303832763",
            "last_scrub_stamp": "2024-09-14T15:47:43.149821+0200",
            "last_deep_scrub": "781724'1296647995",
            "last_deep_scrub_stamp": "2024-09-09T03:41:56.778209+0200",
            "last_clean_scrub_stamp": "2024-09-14T15:47:43.149821+0200",
            "log_size": 0,
            "ondisk_log_size": 0,
            "stats_invalid": false,
            "dirty_stats_invalid": false,
            "omap_stats_invalid": false,
            "hitset_stats_invalid": false,
            "hitset_bytes_stats_invalid": false,
            "pin_stats_invalid": false,
            "manifest_stats_invalid": false,
            "snaptrimq_len": 0,
            "stat_sum": {
                "num_bytes": 0,
                "num_objects": 0,
                "num_object_clones": 0,
                "num_object_copies": 0,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 0,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 0,
                "num_whiteouts": 0,
                "num_read": 0,
                "num_read_kb": 0,
                "num_write": 0,
                "num_write_kb": 0,
                "num_scrub_errors": 0,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 0,
                "num_objects_recovered": 0,
                "num_bytes_recovered": 0,
                "num_keys_recovered": 0,
                "num_objects_omap": 0,
                "num_objects_hit_set_archive": 0,
                "num_bytes_hit_set_archive": 0,
                "num_flush": 0,
                "num_flush_kb": 0,
                "num_evict": 0,
                "num_evict_kb": 0,
                "num_promote": 0,
                "num_flush_mode_high": 0,
                "num_flush_mode_low": 0,
                "num_evict_mode_some": 0,
                "num_evict_mode_full": 0,
                "num_objects_pinned": 0,
                "num_legacy_snapsets": 0,
                "num_large_omap_objects": 0,
                "num_objects_manifest": 0,
                "num_omap_bytes": 0,
                "num_omap_keys": 0,
                "num_objects_repaired": 0
            },
            "up": [
                20,
                253,
                254,
                84,
                2147483647,
                56
            ],
            "acting": [
                20,
                253,
                254,
                84,
                2147483647,
                56
            ],
            "avail_no_missing": [],
            "object_location_counts": [],
            "blocked_by": [
                500
            ],
            "up_primary": 20,
            "acting_primary": 20,
            "purged_snaps": []
        },
        "empty": 1,
        "dne": 0,
        "incomplete": 0,
        "last_epoch_started": 0,
        "hit_set_history": {
            "current_last_update": "0'0",
            "history": []
        }
    },
    "peer_info": [],
    "recovery_state": [
        {
            "name": "Started/Primary/Peering/Down",
            "enter_time": "2024-09-16T17:48:13.572414+0200",
            "comment": "not enough up instances of this PG to go active"
        },
        {
            "name": "Started/Primary/Peering",
            "enter_time": "2024-09-16T17:48:13.572332+0200",
            "past_intervals": [
                {
                    "first": "791030",
                    "last": "794997",
                    "all_participants": [
                        {
                            "osd": 11,
                            "shard": 0
                        },
                        {
                            "osd": 20,
                            "shard": 0
                        },
                        {
                            "osd": 56,
                            "shard": 5
                        },
                        {
                            "osd": 84,
                            "shard": 3
                        },
                        {
                            "osd": 148,
                            "shard": 0
                        },
                        {
                            "osd": 253,
                            "shard": 1
                        },
                        {
                            "osd": 254,
                            "shard": 2
                        },
                        {
                            "osd": 280,
                            "shard": 0
                        },
                        {
                            "osd": 500,
                            "shard": 4
                        }
                    ],
                    "intervals": [
                        {
                            "first": "792995",
                            "last": "792997",
                            "acting": "11(0),56(5),253(1),254(2),500(4)"
                        },
                        {
                            "first": "793671",
                            "last": "793673",
                            "acting": "56(5),84(3),254(2),500(4)"
                        },
                        {
                            "first": "793752",
                            "last": "793754",
                            "acting": "11(0),84(3),253(1),254(2),500(4)"
                        },
                        {
                            "first": "793870",
                            "last": "793874",
                            "acting": "56(5),253(1),254(2),280(0),500(4)"
                        },
                        {
                            "first": "793884",
                            "last": "793887",
                            "acting": "56(5),84(3),253(1),280(0),500(4)"
                        },
                        {
                            "first": "793941",
                            "last": "793944",
                            "acting": "84(3),253(1),254(2),280(0),500(4)"
                        },
                        {
                            "first": "794645",
                            "last": "794649",
                            "acting": "84(3),148(0),253(1),254(2),500(4)"
                        },
                        {
                            "first": "794659",
                            "last": "794662",
                            "acting": "56(5),84(3),148(0),253(1),254(2)"
                        },
                        {
                            "first": "794852",
                            "last": "794858",
                            "acting": "56(5),84(3),253(1),254(2),280(0)"
                        },
                        {
                            "first": "794872",
                            "last": "794886",
                            "acting": "56(5),84(3),253(1),254(2),500(4)"
                        },
                        {
                            "first": "794893",
                            "last": "794895",
                            "acting": "56(5),84(3),253(1),254(2),280(0),500(4)"
                        },
                        {
                            "first": "794898",
                            "last": "794901",
                            "acting": "20(0),56(5),84(3),254(2),500(4)"
                        },
                        {
                            "first": "794909",
                            "last": "794912",
                            "acting": "20(0),56(5),84(3),253(1),500(4)"
                        },
                        {
                            "first": "794922",
                            "last": "794929",
                            "acting": "20(0),56(5),253(1),254(2),500(4)"
                        },
                        {
                            "first": "794938",
                            "last": "794940",
                            "acting": "20(0),56(5),84(3),253(1),254(2)"
                        },
                        {
                            "first": "794952",
                            "last": "794954",
                            "acting": "20(0),84(3),253(1),254(2),500(4)"
                        },
                        {
                            "first": "794955",
                            "last": "794997",
                            "acting": "20(0),56(5),84(3),253(1),254(2),500(4)"
                        }
                    ]
                }
            ],
            "probing_osds": [
                "11(0)",
                "20(0)",
                "56(5)",
                "84(3)",
                "148(0)",
                "253(1)",
                "254(2)",
                "280(0)"
            ],
            "blocked": "peering is blocked due to down osds",
            "down_osds_we_would_probe": [
                500
            ],
            "peering_blocked_by": [
                {
                    "osd": 500,
                    "current_lost_at": 0,
                    "comment": "starting or marking this osd lost may let us proceed"
                }
            ]
        },
        {
            "name": "Started",
            "enter_time": "2024-09-16T17:48:13.572282+0200"
        }
    ],
    "agent_state": {}
}
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx