RE: PG stuck unclean after rebalance-by-weight

Jens Dueholm Christensen <JEDC@xxxxxxxxxxx> · Mon, 9 Jan 2017 13:09:35 +0000

Hi Samuel

The lack of reply and some off-list conversations had me test a few things over the weekend:

1) Playing around with different crushmap settings with no success - every test didn't produce any bad mappings or had to retry a placement more than 50 times (I remember I saw up to 23-25 attempts)
2) Lowering the OSD weight further made more PGs become stuck unclean
3) Lowering the weight to 0 and/or out'ing osd.1 moved the acting OSD id to another OSD, but the PG was still stuck unclean
4) Raising the OSD weight up in small increments (0.05) eventually made the PG clean+active
5) Lowering another OSD's OSD weight made other PGs become stuck unclean, while raising the weight again made the PGs recover.

.. and then it dawned (and I hope I'm right..):

Since the CRUSH weight of nodes ceph4 and ceph5 is relative high compared to nodes ceph1-3, a lower OSD weight for OSDs on nodes ceph1-3 will eventually cause problems for CRUSH since I have "type host" and not "type osd" in my ruleset chooseleaf - my cluster is simply not evenly dimensioned for a "proper" reweight to be successful.

This unbalance and a lower and lower OSD weight makes it harder and harder for CRUSH to place a PG onto the 6 OSDs (osd.0-5 on nodes ceph1-3) up to a point where CRUSH had tried "choose_total_tries" times and then gives up.

This made me raise the OSD weight to 0.7 for the 6 OSDs, which in turn left 1 PG stuck unclean.
Then by raising choose_total_tries to 100 (I did not test 75) the PG became active+clean.

.. and that's where I am now.

I assume that I really ought to correct the imbalance in the hardware before changing the weight of any OSD again, and I take this experience as a leasson learned.

Perhaps a small writeup in the docs about not trying to correct an imbalanced cluster by reweighing is in order and as something to be avoided?

BTW, why should I change min_size to 2?
I know that default copies should never be lower than 3 to avoid some possible nasty errors, but as I understand the documentation (It's even written in the first box in the top of http://docs.ceph.com/docs/master/rados/configuration/pool-pg-config-ref/), setting it to 1 only allow writes in a severly degraded state to be accepted - ie. if only 1 node is available.

Regards,
Jens Dueholm Christensen
Rambøll Survey IT

On Friday, January 06, 2017 4:54 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:

> You should not have min_size set to 1, set it to 2 (not related to
> your problem, but important).

> http://docs.ceph.com/docs/master/rados/operations/crush-map/ has a
> summary of the crush map tunables.

> You might try kicking tunable choose_total_tries 50 to 75.  You should
> use osdmaptool on an osdmap grabbed from your cluster to experiment
> with new crushmap settings without actually injecting the new
> crushmap.

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f