Re: PG that should not be on undersized+degraded on multi datacenter Ceph cluster

Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> · Wed, 7 Jun 2017 09:13:46 +0200

On 06/06/17 19:23, Alejandro Comisario wrote:
> Hi all, i have a multi datacenter 6 nodes (6 osd) ceph jewel cluster.
> There are 3 pools in the cluster, all three with size 3 and min_size 2.
>
> Today, i shut down all three nodes (controlled and in order) on
> datacenter "CPD2" just to validate that everything keeps working on
> "CPD1", whitch did (including rebalance of the infromation).
>
> After everything was off on CPD2, the "osd tree" looks like this,
> whitch seems ok.
>
> root@oskceph01:~# ceph osd tree
> ID WEIGHT   TYPE NAME              UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 30.00000 root default
> -8 15.00000     datacenter CPD1
> -2  5.00000         host oskceph01
>  0  5.00000             osd.0           up  1.00000          1.00000
> -6  5.00000         host oskceph05
>  4  5.00000             osd.4           up  1.00000          1.00000
> -4  5.00000         host oskceph03
>  2  5.00000             osd.2           up  1.00000          1.00000
> -9 15.00000     datacenter CPD2
> -3  5.00000         host oskceph02
>  1  5.00000             osd.1         down        0          1.00000
> -5  5.00000         host oskceph04
>  3  5.00000             osd.3         down        0          1.00000
> -7  5.00000         host oskceph06
>  5  5.00000             osd.5         down        0          1.00000
>
> ...
>
> root@oskceph01:~# ceph pg dump | egrep degrad
> dumped all in format plain
> 8.1b3 178 0 178 0 0 1786814 3078 3078 active+undersized+degraded
> 2017-06-06 13:11:46.130567 2361'250952 2361:248472 [0,2] 0 [0,2] 0
> 1889'249956 2017-06-06 04:11:52.736214 1889'242115 2017-06-03
> 19:07:06.615674
>
> For some extrange reason, i see that the acting set is [0,2] i dont
> see osd.4 on the acting set, and honestly, i dont know why.
>
> ...
I'm assuming you have failure domain as host, not datacenter? (otherwise
you'd never get 0,2 ... and size 3 could never work either)

So then it looks like a problem I had and solved this week... I had 60
osds with 19 down to be replaced, and one pg out of 1152 wouldn't peer.
Randomly I realized what was wrong... there's a "tunable
choose_total_tries" you can increase so the pgs that tried to find an
osd that many times and failed will try more:

> ceph osd getcrushmap -o crushmap
> crushtool -d crushmap -o crushmap.txt
> vim crushmap.txt
>     here you change tunable choose_total_tries higher... default is
> 50. 100 worked for me the first time, and then later I changed it
> again to 200.
> crushtool -c crushmap.txt -o crushmap.new
> ceph osd setcrushmap -i crushmap.new

if anything goes wrong with the new crushmap, you can always set the old
again:
> ceph osd setcrushmap -i crushmap

Then you have to wait some time, maybe 30s before you have pgs peering.

Now if only there was a log or warning seen in ceph -s that said the
tries was exceeded, then this solution would be more obvious (and we
would know whether it applies to you)....

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com