HEALTH_WARN and PGs out of buckets

Simone Spinelli <simone.spinelli@xxxxxxxx> · Sun, 12 Jul 2015 11:43:21 +0200

Dear list,

Our ceph cluster (ceph version 0.87) is stuck in a warning state with 
some OSDs out of their original bucket:

     health HEALTH_WARN 1097 pgs degraded; 15 pgs peering; 1 pgs 
recovering; 1097 pgs stuck degraded; 16 pgs stuck inactive; 26148 pgs 
stuck unclean; 1096 pgs stuck undersized; 1096 pgs undersized; 4 
requests are blocked > 32 sec; recovery 101465/6016350 objects degraded 
(1.686%); 1691712/6016350 objects misplaced (28.119%)
     monmap e2: 3 mons at 
{mon1-r2-ser=172.19.14.130:6789/0,mon1-r3-ser=172.19.14.150:6789/0,mon1-rc3-fib=172.19.14.170:6789/0}, 
election epoch 82, quorum 0,1,2 mon1-r2-ser,mon1-r3-ser,mon1-rc3-fib
     osdmap e15358: 144 osds: 143 up, 143 in
      pgmap v12209990: 38816 pgs, 16 pools, 8472 GB data, 1958 kobjects
            25821 GB used, 234 TB / 259 TB avail
            101465/6016350 objects degraded (1.686%); 1691712/6016350 
objects misplaced (28.119%)
                 620 active
               12668 active+clean
                  15 peering
                 395 active+undersized+degraded+remapped
                   1 active+recovering+degraded
               24416 active+remapped
                   1 undersized+degraded
                 700 active+undersized+degraded
  client io 0 B/s rd, 40557 B/s wr, 13 op/s

Yesterday it was just in a warning state with some PG stuck unclean and 
some requests blocked. As I restarted one of the OSD involved, a 
recovery process started and some OSD went down and then up and some 
others where put out of their original bucket:

# id    weight  type name       up/down reweight
-1      262.1   root default
-15     80.08           datacenter fibonacci
-16     80.08                   rack rack-c03-fib
............
-35	83.72		datacenter ingegneria
-31	0			rack rack-01-ing
-32	0			rack rack-02-ing
-33	0			rack rack-03-ing
-34	0			rack rack-04-ing
-18	83.72			rack rack-03-ser
-13	20.02				host-high-end cnode1-r3-ser
124	1.82					osd.124	up	1
126	1.82					osd.126	up	1
128	1.82					osd.128	up	1
133	1.82					osd.133	up	1
135	1.82					osd.135	up	1
…………
145	1.82					osd.145	up	1
146	1.82					osd.146	up	1
147	1.82					osd.147	up	1
148	1.82					osd.148	up	1
5	1.82		osd.5	up	1
150	1.82		osd.150	up	1
153	1.82		osd.153	up	1
80	1.82		osd.80	up	1
24	1.82		osd.24	up	1
131	1.82		osd.131	up	1

Now, if I put by hand the OSD in its own bucket it works, but I have 
some concerns: why the recovery process is stopped? The cluster is 
almost empty so there is space to recover data even without 6 OSD. Did 
anyone already experience this?
Any advice for what to search?
Any help is appreciated.

Regards
Simone

--
Simone Spinelli <simone.spinelli@xxxxxxxx>
Università di Pisa
Settore Rete, Telecomunicazioni e Fonia - Serra
Direzione Edilizia e Telecomunicazioni

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com