On 06/06/17 19:23, Alejandro Comisario wrote: > Hi all, i have a multi datacenter 6 nodes (6 osd) ceph jewel cluster. > There are 3 pools in the cluster, all three with size 3 and min_size 2. > > Today, i shut down all three nodes (controlled and in order) on > datacenter "CPD2" just to validate that everything keeps working on > "CPD1", whitch did (including rebalance of the infromation). > > After everything was off on CPD2, the "osd tree" looks like this, > whitch seems ok. > > root@oskceph01:~# ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 30.00000 root default > -8 15.00000 datacenter CPD1 > -2 5.00000 host oskceph01 > 0 5.00000 osd.0 up 1.00000 1.00000 > -6 5.00000 host oskceph05 > 4 5.00000 osd.4 up 1.00000 1.00000 > -4 5.00000 host oskceph03 > 2 5.00000 osd.2 up 1.00000 1.00000 > -9 15.00000 datacenter CPD2 > -3 5.00000 host oskceph02 > 1 5.00000 osd.1 down 0 1.00000 > -5 5.00000 host oskceph04 > 3 5.00000 osd.3 down 0 1.00000 > -7 5.00000 host oskceph06 > 5 5.00000 osd.5 down 0 1.00000 > > ... > > root@oskceph01:~# ceph pg dump | egrep degrad > dumped all in format plain > 8.1b3 178 0 178 0 0 1786814 3078 3078 active+undersized+degraded > 2017-06-06 13:11:46.130567 2361'250952 2361:248472 [0,2] 0 [0,2] 0 > 1889'249956 2017-06-06 04:11:52.736214 1889'242115 2017-06-03 > 19:07:06.615674 > > For some extrange reason, i see that the acting set is [0,2] i dont > see osd.4 on the acting set, and honestly, i dont know why. > > ... I'm assuming you have failure domain as host, not datacenter? (otherwise you'd never get 0,2 ... and size 3 could never work either) So then it looks like a problem I had and solved this week... I had 60 osds with 19 down to be replaced, and one pg out of 1152 wouldn't peer. Randomly I realized what was wrong... there's a "tunable choose_total_tries" you can increase so the pgs that tried to find an osd that many times and failed will try more: > ceph osd getcrushmap -o crushmap > crushtool -d crushmap -o crushmap.txt > vim crushmap.txt > here you change tunable choose_total_tries higher... default is > 50. 100 worked for me the first time, and then later I changed it > again to 200. > crushtool -c crushmap.txt -o crushmap.new > ceph osd setcrushmap -i crushmap.new if anything goes wrong with the new crushmap, you can always set the old again: > ceph osd setcrushmap -i crushmap Then you have to wait some time, maybe 30s before you have pgs peering. Now if only there was a log or warning seen in ceph -s that said the tries was exceeded, then this solution would be more obvious (and we would know whether it applies to you).... _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com