Re: PG that should not be on undersized+degraded on multi datacenter Ceph cluster

Alejandro Comisario <alejandro@xxxxxxxxxxx> · Wed, 7 Jun 2017 17:06:42 -0300

Peter, hi.
thanks for the reply, let me check that out, and get back to you

On Wed, Jun 7, 2017 at 4:13 AM, Peter Maloney
<peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:
> On 06/06/17 19:23, Alejandro Comisario wrote:
>> Hi all, i have a multi datacenter 6 nodes (6 osd) ceph jewel cluster.
>> There are 3 pools in the cluster, all three with size 3 and min_size 2.
>>
>> Today, i shut down all three nodes (controlled and in order) on
>> datacenter "CPD2" just to validate that everything keeps working on
>> "CPD1", whitch did (including rebalance of the infromation).
>>
>> After everything was off on CPD2, the "osd tree" looks like this,
>> whitch seems ok.
>>
>> root@oskceph01:~# ceph osd tree
>> ID WEIGHT   TYPE NAME              UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 30.00000 root default
>> -8 15.00000     datacenter CPD1
>> -2  5.00000         host oskceph01
>>  0  5.00000             osd.0           up  1.00000          1.00000
>> -6  5.00000         host oskceph05
>>  4  5.00000             osd.4           up  1.00000          1.00000
>> -4  5.00000         host oskceph03
>>  2  5.00000             osd.2           up  1.00000          1.00000
>> -9 15.00000     datacenter CPD2
>> -3  5.00000         host oskceph02
>>  1  5.00000             osd.1         down        0          1.00000
>> -5  5.00000         host oskceph04
>>  3  5.00000             osd.3         down        0          1.00000
>> -7  5.00000         host oskceph06
>>  5  5.00000             osd.5         down        0          1.00000
>>
>> ...
>>
>> root@oskceph01:~# ceph pg dump | egrep degrad
>> dumped all in format plain
>> 8.1b3 178 0 178 0 0 1786814 3078 3078 active+undersized+degraded
>> 2017-06-06 13:11:46.130567 2361'250952 2361:248472 [0,2] 0 [0,2] 0
>> 1889'249956 2017-06-06 04:11:52.736214 1889'242115 2017-06-03
>> 19:07:06.615674
>>
>> For some extrange reason, i see that the acting set is [0,2] i dont
>> see osd.4 on the acting set, and honestly, i dont know why.
>>
>> ...
> I'm assuming you have failure domain as host, not datacenter? (otherwise
> you'd never get 0,2 ... and size 3 could never work either)
>
> So then it looks like a problem I had and solved this week... I had 60
> osds with 19 down to be replaced, and one pg out of 1152 wouldn't peer.
> Randomly I realized what was wrong... there's a "tunable
> choose_total_tries" you can increase so the pgs that tried to find an
> osd that many times and failed will try more:
>
>> ceph osd getcrushmap -o crushmap
>> crushtool -d crushmap -o crushmap.txt
>> vim crushmap.txt
>>     here you change tunable choose_total_tries higher... default is
>> 50. 100 worked for me the first time, and then later I changed it
>> again to 200.
>> crushtool -c crushmap.txt -o crushmap.new
>> ceph osd setcrushmap -i crushmap.new
>
> if anything goes wrong with the new crushmap, you can always set the old
> again:
>> ceph osd setcrushmap -i crushmap
>
> Then you have to wait some time, maybe 30s before you have pgs peering.
>
> Now if only there was a log or warning seen in ceph -s that said the
> tries was exceeded, then this solution would be more obvious (and we
> would know whether it applies to you)....
>

-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com