Dear list,
Our ceph cluster (ceph version 0.87) is stuck in a warning state with
some OSDs out of their original bucket:
health HEALTH_WARN 1097 pgs degraded; 15 pgs peering; 1 pgs
recovering; 1097 pgs stuck degraded; 16 pgs stuck inactive; 26148 pgs
stuck unclean; 1096 pgs stuck undersized; 1096 pgs undersized; 4
requests are blocked > 32 sec; recovery 101465/6016350 objects degraded
(1.686%); 1691712/6016350 objects misplaced (28.119%)
monmap e2: 3 mons at
{mon1-r2-ser=172.19.14.130:6789/0,mon1-r3-ser=172.19.14.150:6789/0,mon1-rc3-fib=172.19.14.170:6789/0},
election epoch 82, quorum 0,1,2 mon1-r2-ser,mon1-r3-ser,mon1-rc3-fib
osdmap e15358: 144 osds: 143 up, 143 in
pgmap v12209990: 38816 pgs, 16 pools, 8472 GB data, 1958 kobjects
25821 GB used, 234 TB / 259 TB avail
101465/6016350 objects degraded (1.686%); 1691712/6016350
objects misplaced (28.119%)
620 active
12668 active+clean
15 peering
395 active+undersized+degraded+remapped
1 active+recovering+degraded
24416 active+remapped
1 undersized+degraded
700 active+undersized+degraded
client io 0 B/s rd, 40557 B/s wr, 13 op/s
Yesterday it was just in a warning state with some PG stuck unclean and
some requests blocked. As I restarted one of the OSD involved, a
recovery process started and some OSD went down and then up and some
others where put out of their original bucket:
# id weight type name up/down reweight
-1 262.1 root default
-15 80.08 datacenter fibonacci
-16 80.08 rack rack-c03-fib
............
-35 83.72 datacenter ingegneria
-31 0 rack rack-01-ing
-32 0 rack rack-02-ing
-33 0 rack rack-03-ing
-34 0 rack rack-04-ing
-18 83.72 rack rack-03-ser
-13 20.02 host-high-end cnode1-r3-ser
124 1.82 osd.124 up 1
126 1.82 osd.126 up 1
128 1.82 osd.128 up 1
133 1.82 osd.133 up 1
135 1.82 osd.135 up 1
…………
145 1.82 osd.145 up 1
146 1.82 osd.146 up 1
147 1.82 osd.147 up 1
148 1.82 osd.148 up 1
5 1.82 osd.5 up 1
150 1.82 osd.150 up 1
153 1.82 osd.153 up 1
80 1.82 osd.80 up 1
24 1.82 osd.24 up 1
131 1.82 osd.131 up 1
Now, if I put by hand the OSD in its own bucket it works, but I have
some concerns: why the recovery process is stopped? The cluster is
almost empty so there is space to recover data even without 6 OSD. Did
anyone already experience this?
Any advice for what to search?
Any help is appreciated.
Regards
Simone
--
Simone Spinelli <simone.spinelli@xxxxxxxx>
Università di Pisa
Settore Rete, Telecomunicazioni e Fonia - Serra
Direzione Edilizia e Telecomunicazioni
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com