Re: Ceph PGs stuck inactive after rebuild node

Eugen Block <eblock@xxxxxx> · Fri, 08 Apr 2022 07:03:22 +0000

Hi,

thanks for your explanation, Josh. I think In understand now how  
mon_max_pg_per_osd could have an impact here. The default seems to be  
250. Each OSD currently has around 100 PGs, is this a potential  
bottleneck? In my test cluster I have around 150 PGs per OSD and  
couldn't reproduce it. Although I have different crush rules in place.  
I'll add the rule in question at the bottom, do you see a potential  
issue there?
If I increase mon_max_pg_per_osd temporarily to let's say 500 would  
this decrease the risk? And draining the OSDs before purging and  
rebuilding doesn't mean the same can happen again if the OSDs join the  
cluster, right?

Thanks,
Eugen

{
                "rule_id": 1,
                "rule_name": "rule-ec-k7m11",
                "ruleset": 1,
                "type": 3,
                "min_size": 3,
                "max_size": 18,
                "steps": [
                    {
                        "op": "set_chooseleaf_tries",
                        "num": 5
                    },
                    {
                        "op": "set_choose_tries",
                        "num": 100
                    },
                    {
                        "op": "take",
                        "item": -2,
                        "item_name": "default~hdd"
                    },
                    {
                        "op": "choose_indep",
                        "num": 2,
                        "type": "datacenter"
                    },
                    {
                        "op": "chooseleaf_indep",
                        "num": 9,
                        "type": "host"
                    },
                    {
                        "op": "emit"
                    }
                ]
            },

Zitat von Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx>:

On Thu, Apr 7, 2022 at 12:15 AM Eugen Block <eblock@xxxxxx> wrote:
Basically, these are the steps to remove all OSDs from that host (OSDs
are not "replaced" so they aren't marked "destroyed") [1]:

1) Call 'ceph osd out $id'
2) Call systemctl stop ceph-osd@$id
3) ceph osd purge $id --yes-i-really-mean-it

Ah, this purge step would still be enough to potentially cause the problem.

To be clear, then, this is the sequence I'm proposing could have been
the problem:
1. All OSDs on the host are purged per above.
2. New OSDs are created.
3. As they come up, one by one, CRUSH starts to assign PGs to them.
Importantly, when the first OSD comes up, it gets a large number of
OSDs, exceeding mon_max_pg_per_osd. Thus, some of these PGs don't
activate.
4. As each of the remaining OSDs come up, CRUSH re-assigns some PGs to them.
5. Finally, all OSDs are up. However, any PGs that were stuck in
"activating" from step 3 that were _not_ reassigned to other OSDs are
still stuck in "activating", and need a repeer or OSD down/up cycle to
restart peering for them. (At least in Pacific, tweaking
mon_max_pg_per_osd also allows some of these PGs to make peering
progress.)

This assumes that the CRUSH rule in question leads to this sort of
behaviour. I would expect this more from a host-centric CRUSH rule
than from a rack-centric one, for example. Also note that it's a
matter of "luck" as to whether you'll actually see a problem as of
step 5, since if it so happens that all of the PGs stuck in
"activating" in step 3 get assigned, the final state of the cluster
will be fine (i.e. no inactive PGs).

Josh

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx