On Friday, January 23, 2015, Glen Aidukas <GAidukas@xxxxxxxxxxxxxxxxxx> wrote:
Hello fellow ceph users,
I ran into a major issue were two KVM hosts will not start due to issues with my Ceph cluster.
Here are some details:
Running ceph version 0.87. There are 10 hosts with 6 drives each for 60 OSDs.
# ceph -s
cluster 1431e336-faa2-4b13-b50d-c1d375b4e64b
health HEALTH_WARN 7 pgs incomplete; 7 pgs stuck inactive; 7 pgs stuck unclean; 71 requests are blocked > 32 sec; pool rbd-b has too few pgs
monmap e1: 3 mons at {xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx}, election epoch 92, quorum 0,1,2 ceph-b01,ceph-b02,ceph-b03
mdsmap e49: 1/1/1 up {0=pmceph-b06=up:active}, 1 up:standby
osdmap e10023: 60 osds: 60 up, 60 in
pgmap v19851672: 45056 pgs, 22 pools, 13318 GB data, 3922 kobjects
39863 GB used, 178 TB / 217 TB avail
45049 active+clean
7 incomplete
client io 954 kB/s rd, 386 kB/s wr, 78 op/s
# ceph health detail
HEALTH_WARN 7 pgs incomplete; 7 pgs stuck inactive; 7 pgs stuck unclean; 69 requests are blocked > 32 sec; 5 osds have slow requests; pool rbd-b has too few pgs
pg 3.38b is stuck inactive since forever, current state incomplete, last acting [48,35,2]
pg 1.541 is stuck inactive since forever, current state incomplete, last acting [48,20,2]
pg 3.57d is stuck inactive for 15676.967208, current state incomplete, last acting [55,48,2]
pg 3.5c9 is stuck inactive since forever, current state incomplete, last acting [48,2,15]
pg 3.540 is stuck inactive for 15676.959093, current state incomplete, last acting [57,48,2]
pg 3.5a5 is stuck inactive since forever, current state incomplete, last acting [2,48,57]
pg 3.305 is stuck inactive for 15676.855987, current state incomplete, last acting [39,2,48]
pg 3.38b is stuck unclean since forever, current state incomplete, last acting [48,35,2]
pg 1.541 is stuck unclean since forever, current state incomplete, last acting [48,20,2]
pg 3.57d is stuck unclean for 15676.971318, current state incomplete, last acting [55,48,2]
pg 3.5c9 is stuck unclean since forever, current state incomplete, last acting [48,2,15]
pg 3.540 is stuck unclean for 15676.963204, current state incomplete, last acting [57,48,2]
pg 3.5a5 is stuck unclean since forever, current state incomplete, last acting [2,48,57]
pg 3.305 is stuck unclean for 15676.860098, current state incomplete, last acting [39,2,48]
pg 3.5c9 is incomplete, acting [48,2,15] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.5a5 is incomplete, acting [2,48,57] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.57d is incomplete, acting [55,48,2] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.540 is incomplete, acting [57,48,2] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 1.541 is incomplete, acting [48,20,2] (reducing pool metadata min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.38b is incomplete, acting [48,35,2] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.305 is incomplete, acting [39,2,48] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete')
20 ops are blocked > 2097.15 sec
49 ops are blocked > 1048.58 sec
13 ops are blocked > 2097.15 sec on osd.2
7 ops are blocked > 2097.15 sec on osd.39
3 ops are blocked > 1048.58 sec on osd.39
41 ops are blocked > 1048.58 sec on osd.48
4 ops are blocked > 1048.58 sec on osd.55
1 ops are blocked > 1048.58 sec on osd.57
5 osds have slow requests
pool rbd-b objects per pg (1084) is more than 12.1798 times cluster average (89)
I ran the following but did not help:
# ceph health detail | grep ^pg | cut -c4-9 | while read i; do ceph pg repair ${i} ; done
instructing pg 3.38b on osd.48 to repair
instructing pg 1.541 on osd.48 to repair
instructing pg 3.57d on osd.55 to repair
instructing pg 3.5c9 on osd.48 to repair
instructing pg 3.540 on osd.57 to repair
instructing pg 3.5a5 on osd.2 to repair
instructing pg 3.305 on osd.39 to repair
instructing pg 3.38b on osd.48 to repair
instructing pg 1.541 on osd.48 to repair
instructing pg 3.57d on osd.55 to repair
instructing pg 3.5c9 on osd.48 to repair
instructing pg 3.540 on osd.57 to repair
instructing pg 3.5a5 on osd.2 to repair
instructing pg 3.305 on osd.39 to repair
instructing pg 3.5c9 on osd.48 to repair
instructing pg 3.5a5 on osd.2 to repair
instructing pg 3.57d on osd.55 to repair
instructing pg 3.540 on osd.57 to repair
instructing pg 1.541 on osd.48 to repair
instructing pg 3.38b on osd.48 to repair
instructing pg 3.305 on osd.39 to repair
Also, if I run the following cmd, it seems to just hang.
rbd -p rbd-b info vm-50193-disk-1 ß hangs until I do CTRL-c…
Any help would be greatly appreciated!
Glen Aidukas
Manager IT Infrastructure
t: 610.813.2815
BehaviorMatrix, LLC | 676 Dekalb Pike, Suite 200, Blue Bell, PA, 19422
--
Sent while moving
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com