Hi,
thanks for your explanation, Josh. I think In understand now how
mon_max_pg_per_osd could have an impact here. The default seems to be
250. Each OSD currently has around 100 PGs, is this a potential
bottleneck? In my test cluster I have around 150 PGs per OSD and
couldn't reproduce it. Although I have different crush rules in place.
I'll add the rule in question at the bottom, do you see a potential
issue there?
If I increase mon_max_pg_per_osd temporarily to let's say 500 would
this decrease the risk? And draining the OSDs before purging and
rebuilding doesn't mean the same can happen again if the OSDs join the
cluster, right?
Thanks,
Eugen
{
"rule_id": 1,
"rule_name": "rule-ec-k7m11",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 18,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "choose_indep",
"num": 2,
"type": "datacenter"
},
{
"op": "chooseleaf_indep",
"num": 9,
"type": "host"
},
{
"op": "emit"
}
]
},
Zitat von Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx>:
On Thu, Apr 7, 2022 at 12:15 AM Eugen Block <eblock@xxxxxx> wrote:
Basically, these are the steps to remove all OSDs from that host (OSDs
are not "replaced" so they aren't marked "destroyed") [1]:
1) Call 'ceph osd out $id'
2) Call systemctl stop ceph-osd@$id
3) ceph osd purge $id --yes-i-really-mean-it
Ah, this purge step would still be enough to potentially cause the problem.
To be clear, then, this is the sequence I'm proposing could have been
the problem:
1. All OSDs on the host are purged per above.
2. New OSDs are created.
3. As they come up, one by one, CRUSH starts to assign PGs to them.
Importantly, when the first OSD comes up, it gets a large number of
OSDs, exceeding mon_max_pg_per_osd. Thus, some of these PGs don't
activate.
4. As each of the remaining OSDs come up, CRUSH re-assigns some PGs to them.
5. Finally, all OSDs are up. However, any PGs that were stuck in
"activating" from step 3 that were _not_ reassigned to other OSDs are
still stuck in "activating", and need a repeer or OSD down/up cycle to
restart peering for them. (At least in Pacific, tweaking
mon_max_pg_per_osd also allows some of these PGs to make peering
progress.)
This assumes that the CRUSH rule in question leads to this sort of
behaviour. I would expect this more from a host-centric CRUSH rule
than from a rack-centric one, for example. Also note that it's a
matter of "luck" as to whether you'll actually see a problem as of
step 5, since if it so happens that all of the PGs stuck in
"activating" in step 3 get assigned, the final state of the cluster
will be fine (i.e. no inactive PGs).
Josh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx