Re: Ceph PGs stuck inactive after rebuild node

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

It could be, yes. I've seen a case on a test cluster where thousands
of PGs were assigned to a single OSD even when the steady state was
far fewer than that.

how did you determine how many PGs were assigned to the OSDs? I looked at one of the OSD's logs and checked how many times each PG chunk of the affected pool was logged during startup. I got around 580 unique entries, but I'm not sure if that's representative. Do you have a better approach? If this is actually what ceph is trying to do on this OSD you may be right about the mentioned mon_max_pg_per_osd limit. But I'll need to verify somehow before retrying anything on the customer's cluster.

That's how I got around this issue in my test env. However, another
way to do this would be to not create the OSDs one-by-one at full
weight but rather bring them back at 0 weight and then upweight them
all bit by bit (or maybe even all at once would work?) to avoid the
temporary state.

I've thought about it, but there are too many hosts and OSDs to reweight for the customer. That's why I'm looking for any possibility to avoid this issue. I'll try to reproduce and see if the temporary increase of mon_max_pg_per_osd helps to avoid this.

Thanks,
Eugen

Zitat von Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx>:

Hi Eugen,

thanks for your explanation, Josh. I think In understand now how
mon_max_pg_per_osd could have an impact here. The default seems to be
250. Each OSD currently has around 100 PGs, is this a potential
bottleneck?

It could be, yes. I've seen a case on a test cluster where thousands
of PGs were assigned to a single OSD even when the steady state was
far fewer than that.

I'll add the rule in question at the bottom, do you see a potential
issue there?

It does choose a host, which is similar to the case I had in mind.
(Though in my case the OSDs weren't purged and thus the host weight
was high, which sounds potentially different from your procedure...)

If I increase mon_max_pg_per_osd temporarily to let's say 500 would
this decrease the risk?

That's how I got around this issue in my test env. However, another
way to do this would be to not create the OSDs one-by-one at full
weight but rather bring them back at 0 weight and then upweight them
all bit by bit (or maybe even all at once would work?) to avoid the
temporary state.

And draining the OSDs before purging and rebuilding doesn't mean the same can happen again if the OSDs join the
cluster, right?

Right, because it's an issue of up-set assignment.

Everything above is of course speculation unless you catch this in the
act again...

Josh



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux