PG that should not be on undersized+degraded on multi datacenter Ceph cluster

Alejandro Comisario <alejandro@xxxxxxxxxxx> · Tue, 6 Jun 2017 14:23:00 -0300

Hi all, i have a multi datacenter 6 nodes (6 osd) ceph jewel cluster.
There are 3 pools in the cluster, all three with size 3 and min_size 2.

Today, i shut down all three nodes (controlled and in order) on
datacenter "CPD2" just to validate that everything keeps working on
"CPD1", whitch did (including rebalance of the infromation).

After everything was off on CPD2, the "osd tree" looks like this,
whitch seems ok.

root@oskceph01:~# ceph osd tree
ID WEIGHT   TYPE NAME              UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 30.00000 root default
-8 15.00000     datacenter CPD1
-2  5.00000         host oskceph01
 0  5.00000             osd.0           up  1.00000          1.00000
-6  5.00000         host oskceph05
 4  5.00000             osd.4           up  1.00000          1.00000
-4  5.00000         host oskceph03
 2  5.00000             osd.2           up  1.00000          1.00000
-9 15.00000     datacenter CPD2
-3  5.00000         host oskceph02
 1  5.00000             osd.1         down        0          1.00000
-5  5.00000         host oskceph04
 3  5.00000             osd.3         down        0          1.00000
-7  5.00000         host oskceph06
 5  5.00000             osd.5         down        0          1.00000

Meaning that all PGS should have as acting set the osds 0, 2 and 4.
But "ceph health detail" shows me this weird PG in undersized+degraded
state as follow:

root@oskceph01:~# ceph health detail
HEALTH_WARN 1 pgs degraded; 1 pgs stuck unclean; 1 pgs undersized;
recovery 178/310287 objects degraded (0.057%); too many PGs per OSD
(1835 > max 300)
pg 8.1b3 is stuck unclean for 7735.364142, current state
active+undersized+degraded, last acting [0,2]
pg 8.1b3 is active+undersized+degraded, acting [0,2]
recovery 178/310287 objects degraded (0.057%)

the "pg dump" command shows as follow.

root@oskceph01:~# ceph pg dump | egrep degrad
dumped all in format plain
8.1b3 178 0 178 0 0 1786814 3078 3078 active+undersized+degraded
2017-06-06 13:11:46.130567 2361'250952 2361:248472 [0,2] 0 [0,2] 0
1889'249956 2017-06-06 04:11:52.736214 1889'242115 2017-06-03
19:07:06.615674

For some extrange reason, i see that the acting set is [0,2] i dont
see osd.4 on the acting set, and honestly, i dont know why.

Tried "pg repair" with no luck, and dont know what's the right way
then to fix/understand whats going on.

thanks!
-- 
Alejandrito
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com