PG inactive when host is down despite CRUSH failure domain being host

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Wed, 10 Mar 2021 17:02:50 +0100

Hi,

I am having a weird phenomenon, which I am having trouble to debug. We 
have 16 OSDs per host, so when I reboot one node, 16 OSDs will be 
missing for a short time. Since our minimum CRUSH failure domain is 
host, this should not cause any problems. Unfortunately, I always have 
handful (1-5) PGs that become inactive nonetheless and are stuck in the 
state undersized+degraded+peered until the host and its OSDs are back 
up. The other 2000+ PGs that are also on these OSDs do not have this 
problem. In total, we have between 110 and 150 PGs per OSD with a 
configured maximum of 250, which should give us enough headspace.

The affected pools always seem to be RBD pools or at least I haven't 
seen it on our much larger RGW pool yet. The pool's CRUSH rule looks 
like this:

rule rbd-data {
        id 8
        type replicated
        min_size 2
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

ceph pg dump_stuck inactive gives me this:

PG_STAT  STATE                       UP          UP_PRIMARY ACTING      
ACTING_PRIMARY
115.3    undersized+degraded+peered   [194,267]         194 
[194,267]             194
115.13   undersized+degraded+peered  [151,1122]         151 
[151,1122]             151
116.12   undersized+degraded+peered   [288,726]         288 
[288,726]             288

and when I query one of the inactive PGs, I see (among other things):

    "up": [
        288,
        726
    ],
    "acting": [
        288,
        726
    ],
    "acting_recovery_backfill": [
        "288",
        "726"
    ],

    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2021-03-10T16:23:09.301174+0100",
            "might_have_unfound": [],
            "recovery_progress": {
                "backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "pull_from_peer": [],
                    "pushing": []
                }
            }
        },
        {
            "name": "Started",
            "enter_time": "2021-03-10T16:23:08.297622+0100"
        }
    ],

So you can see that two out of three OSDs on other hosts are indeed up 
and active and the . I also see the ceph-osd daemons running on those 
hosts, so the data is definitely there and the PG should be available. 
Do you have any idea why these PGs may be becoming inactive nonetheless? 
I am suspecting some kind of concurrency limit, but I wouldn't know 
which one that could be.

Thanks
Janek
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx