Re: Ceph PGs stuck inactive after rebuild node

Eugen Block <eblock@xxxxxx> · Mon, 11 Apr 2022 11:39:16 +0000

Hi,

It could be, yes. I've seen a case on a test cluster where thousands
of PGs were assigned to a single OSD even when the steady state was
far fewer than that.

how did you determine how many PGs were assigned to the OSDs? I looked  
at one of the OSD's logs and checked how many times each PG chunk of  
the affected pool was logged during startup. I got around 580 unique  
entries, but I'm not sure if that's representative. Do you have a  
better approach? If this is actually what ceph is trying to do on this  
OSD you may be right about the mentioned mon_max_pg_per_osd limit. But  
I'll need to verify somehow before retrying anything on the customer's  
cluster.

That's how I got around this issue in my test env. However, another
way to do this would be to not create the OSDs one-by-one at full
weight but rather bring them back at 0 weight and then upweight them
all bit by bit (or maybe even all at once would work?) to avoid the
temporary state.

I've thought about it, but there are too many hosts and OSDs to  
reweight for the customer. That's why I'm looking for any possibility  
to avoid this issue. I'll try to reproduce and see if the temporary  
increase of mon_max_pg_per_osd helps to avoid this.

Thanks,
Eugen

Zitat von Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx>:

Hi Eugen,

thanks for your explanation, Josh. I think In understand now how
mon_max_pg_per_osd could have an impact here. The default seems to be
250. Each OSD currently has around 100 PGs, is this a potential
bottleneck?

It could be, yes. I've seen a case on a test cluster where thousands
of PGs were assigned to a single OSD even when the steady state was
far fewer than that.

I'll add the rule in question at the bottom, do you see a potential
issue there?

It does choose a host, which is similar to the case I had in mind.
(Though in my case the OSDs weren't purged and thus the host weight
was high, which sounds potentially different from your procedure...)

If I increase mon_max_pg_per_osd temporarily to let's say 500 would
this decrease the risk?

That's how I got around this issue in my test env. However, another
way to do this would be to not create the OSDs one-by-one at full
weight but rather bring them back at 0 weight and then upweight them
all bit by bit (or maybe even all at once would work?) to avoid the
temporary state.

And draining the OSDs before purging and rebuilding doesn't mean  
the same can happen again if the OSDs join the
cluster, right?

Right, because it's an issue of up-set assignment.

Everything above is of course speculation unless you catch this in the
act again...

Josh

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx