Re: Ceph PGs stuck inactive after rebuild node

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

thanks for your explanation, Josh. I think In understand now how mon_max_pg_per_osd could have an impact here. The default seems to be 250. Each OSD currently has around 100 PGs, is this a potential bottleneck? In my test cluster I have around 150 PGs per OSD and couldn't reproduce it. Although I have different crush rules in place. I'll add the rule in question at the bottom, do you see a potential issue there? If I increase mon_max_pg_per_osd temporarily to let's say 500 would this decrease the risk? And draining the OSDs before purging and rebuilding doesn't mean the same can happen again if the OSDs join the cluster, right?

Thanks,
Eugen

{
                "rule_id": 1,
                "rule_name": "rule-ec-k7m11",
                "ruleset": 1,
                "type": 3,
                "min_size": 3,
                "max_size": 18,
                "steps": [
                    {
                        "op": "set_chooseleaf_tries",
                        "num": 5
                    },
                    {
                        "op": "set_choose_tries",
                        "num": 100
                    },
                    {
                        "op": "take",
                        "item": -2,
                        "item_name": "default~hdd"
                    },
                    {
                        "op": "choose_indep",
                        "num": 2,
                        "type": "datacenter"
                    },
                    {
                        "op": "chooseleaf_indep",
                        "num": 9,
                        "type": "host"
                    },
                    {
                        "op": "emit"
                    }
                ]
            },


Zitat von Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx>:

On Thu, Apr 7, 2022 at 12:15 AM Eugen Block <eblock@xxxxxx> wrote:
Basically, these are the steps to remove all OSDs from that host (OSDs
are not "replaced" so they aren't marked "destroyed") [1]:

1) Call 'ceph osd out $id'
2) Call systemctl stop ceph-osd@$id
3) ceph osd purge $id --yes-i-really-mean-it

Ah, this purge step would still be enough to potentially cause the problem.

To be clear, then, this is the sequence I'm proposing could have been
the problem:
1. All OSDs on the host are purged per above.
2. New OSDs are created.
3. As they come up, one by one, CRUSH starts to assign PGs to them.
Importantly, when the first OSD comes up, it gets a large number of
OSDs, exceeding mon_max_pg_per_osd. Thus, some of these PGs don't
activate.
4. As each of the remaining OSDs come up, CRUSH re-assigns some PGs to them.
5. Finally, all OSDs are up. However, any PGs that were stuck in
"activating" from step 3 that were _not_ reassigned to other OSDs are
still stuck in "activating", and need a repeer or OSD down/up cycle to
restart peering for them. (At least in Pacific, tweaking
mon_max_pg_per_osd also allows some of these PGs to make peering
progress.)

This assumes that the CRUSH rule in question leads to this sort of
behaviour. I would expect this more from a host-centric CRUSH rule
than from a rack-centric one, for example. Also note that it's a
matter of "luck" as to whether you'll actually see a problem as of
step 5, since if it so happens that all of the PGs stuck in
"activating" in step 3 get assigned, the final state of the cluster
will be fine (i.e. no inactive PGs).

Josh



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux