Hi,
I am having a weird phenomenon, which I am having trouble to debug. We
have 16 OSDs per host, so when I reboot one node, 16 OSDs will be
missing for a short time. Since our minimum CRUSH failure domain is
host, this should not cause any problems. Unfortunately, I always have
handful (1-5) PGs that become inactive nonetheless and are stuck in the
state undersized+degraded+peered until the host and its OSDs are back
up. The other 2000+ PGs that are also on these OSDs do not have this
problem. In total, we have between 110 and 150 PGs per OSD with a
configured maximum of 250, which should give us enough headspace.
The affected pools always seem to be RBD pools or at least I haven't
seen it on our much larger RGW pool yet. The pool's CRUSH rule looks
like this:
rule rbd-data {
id 8
type replicated
min_size 2
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
ceph pg dump_stuck inactive gives me this:
PG_STAT STATE UP UP_PRIMARY ACTING
ACTING_PRIMARY
115.3 undersized+degraded+peered [194,267] 194
[194,267] 194
115.13 undersized+degraded+peered [151,1122] 151
[151,1122] 151
116.12 undersized+degraded+peered [288,726] 288
[288,726] 288
and when I query one of the inactive PGs, I see (among other things):
"up": [
288,
726
],
"acting": [
288,
726
],
"acting_recovery_backfill": [
"288",
"726"
],
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2021-03-10T16:23:09.301174+0100",
"might_have_unfound": [],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"pull_from_peer": [],
"pushing": []
}
}
},
{
"name": "Started",
"enter_time": "2021-03-10T16:23:08.297622+0100"
}
],
So you can see that two out of three OSDs on other hosts are indeed up
and active and the . I also see the ceph-osd daemons running on those
hosts, so the data is definitely there and the PG should be available.
Do you have any idea why these PGs may be becoming inactive nonetheless?
I am suspecting some kind of concurrency limit, but I wouldn't know
which one that could be.
Thanks
Janek
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx