Hello fellow ceph users, I ran into a major issue were two KVM hosts will not start due to issues with my Ceph cluster. Here are some details: Running ceph version 0.87. There are 10 hosts with 6 drives each for 60 OSDs. # ceph -s cluster 1431e336-faa2-4b13-b50d-c1d375b4e64b health HEALTH_WARN 7 pgs incomplete; 7 pgs stuck inactive; 7 pgs stuck unclean; 71 requests are blocked > 32 sec; pool rbd-b has too few pgs monmap e1: 3 mons at {xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx}, election epoch 92, quorum 0,1,2 ceph-b01,ceph-b02,ceph-b03 mdsmap e49: 1/1/1 up {0=pmceph-b06=up:active}, 1 up:standby osdmap e10023: 60 osds: 60 up, 60 in pgmap v19851672: 45056 pgs, 22 pools, 13318 GB data, 3922 kobjects 39863 GB used, 178 TB / 217 TB avail 45049 active+clean 7 incomplete client io 954 kB/s rd, 386 kB/s wr, 78 op/s # ceph health detail HEALTH_WARN 7 pgs incomplete; 7 pgs stuck inactive; 7 pgs stuck unclean; 69 requests are blocked > 32 sec; 5 osds have slow requests; pool rbd-b has too few pgs pg 3.38b is stuck inactive since forever, current state incomplete, last acting [48,35,2] pg 1.541 is stuck inactive since forever, current state incomplete, last acting [48,20,2] pg 3.57d is stuck inactive for 15676.967208, current state incomplete, last acting [55,48,2] pg 3.5c9 is stuck inactive since forever, current state incomplete, last acting [48,2,15] pg 3.540 is stuck inactive for 15676.959093, current state incomplete, last acting [57,48,2] pg 3.5a5 is stuck inactive since forever, current state incomplete, last acting [2,48,57] pg 3.305 is stuck inactive for 15676.855987, current state incomplete, last acting [39,2,48] pg 3.38b is stuck unclean since forever, current state incomplete, last acting [48,35,2] pg 1.541 is stuck unclean since forever, current state incomplete, last acting [48,20,2] pg 3.57d is stuck unclean for 15676.971318, current state incomplete, last acting [55,48,2] pg 3.5c9 is stuck unclean since forever, current state incomplete, last acting [48,2,15] pg 3.540 is stuck unclean for 15676.963204, current state incomplete, last acting [57,48,2] pg 3.5a5 is stuck unclean since forever, current state incomplete, last acting [2,48,57] pg 3.305 is stuck unclean for 15676.860098, current state incomplete, last acting [39,2,48] pg 3.5c9 is incomplete, acting [48,2,15] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 3.5a5 is incomplete, acting [2,48,57] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 3.57d is incomplete, acting [55,48,2] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 3.540 is incomplete, acting [57,48,2] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 1.541 is incomplete, acting [48,20,2] (reducing pool metadata min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 3.38b is incomplete, acting [48,35,2] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 3.305 is incomplete, acting [39,2,48] (reducing pool rbd-b min_size from 2 may help; search ceph.com/docs for 'incomplete') 20 ops are blocked > 2097.15 sec 49 ops are blocked > 1048.58 sec 13 ops are blocked > 2097.15 sec on osd.2 7 ops are blocked > 2097.15 sec on osd.39 3 ops are blocked > 1048.58 sec on osd.39 41 ops are blocked > 1048.58 sec on osd.48 4 ops are blocked > 1048.58 sec on osd.55 1 ops are blocked > 1048.58 sec on osd.57 5 osds have slow requests pool rbd-b objects per pg (1084) is more than 12.1798 times cluster average (89) I ran the following but did not help: # ceph health detail | grep ^pg | cut -c4-9 | while read i; do ceph pg repair ${i} ; done instructing pg 3.38b on osd.48 to repair instructing pg 1.541 on osd.48 to repair instructing pg 3.57d on osd.55 to repair instructing pg 3.5c9 on osd.48 to repair instructing pg 3.540 on osd.57 to repair instructing pg 3.5a5 on osd.2 to repair instructing pg 3.305 on osd.39 to repair instructing pg 3.38b on osd.48 to repair instructing pg 1.541 on osd.48 to repair instructing pg 3.57d on osd.55 to repair instructing pg 3.5c9 on osd.48 to repair instructing pg 3.540 on osd.57 to repair instructing pg 3.5a5 on osd.2 to repair instructing pg 3.305 on osd.39 to repair instructing pg 3.5c9 on osd.48 to repair instructing pg 3.5a5 on osd.2 to repair instructing pg 3.57d on osd.55 to repair instructing pg 3.540 on osd.57 to repair instructing pg 1.541 on osd.48 to repair instructing pg 3.38b on osd.48 to repair instructing pg 3.305 on osd.39 to repair Also, if I run the following cmd, it seems to just hang. rbd -p rbd-b info vm-50193-disk-1
ß hangs until I do CTRL-c… Any help would be greatly appreciated! Glen Aidukas Manager IT Infrastructure t: 610.813.2815 BehaviorMatrix, LLC | 676 Dekalb Pike, Suite 200, Blue Bell, PA, 19422 |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com