I restarted both osd daemons and things are back to normal. I'm not sure why they failed in the first place but I'll keep looking. Thanks! Shain Sent from my iPhone > On Sep 6, 2016, at 10:39 PM, lyt_yudi <lyt_yudi@xxxxxxxxxx> wrote: > > hi, > >> 在 2016年9月7日,上午10:17,Shain Miley <smiley@xxxxxxx> 写道: >> >> Hello, >> >> It looks like we had 2 osd's fail at some point earlier today, here is the current status of the cluster: >> >> root@rbd1:~# ceph -s >> cluster 504b5794-34bd-44e7-a8c3-0494cf800c23 >> health HEALTH_WARN >> 2 pgs backfill >> 5 pgs backfill_toofull >> 69 pgs backfilling >> 74 pgs degraded >> 1 pgs down >> 1 pgs peering >> 74 pgs stuck degraded >> 1 pgs stuck inactive >> 75 pgs stuck unclean >> 74 pgs stuck undersized >> 74 pgs undersized >> recovery 1903019/105270534 objects degraded (1.808%) >> recovery 1120305/105270534 objects misplaced (1.064%) >> crush map has legacy tunables >> monmap e1: 3 mons at {hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0} >> election epoch 282, quorum 0,1,2 hqceph1,hqceph2,hqceph3 >> osdmap e25019: 108 osds: 105 up, 105 in; 74 remapped pgs >> pgmap v30721368: 3976 pgs, 17 pools, 144 TB data, 51401 kobjects >> 285 TB used, 97367 GB / 380 TB avail >> 1903019/105270534 objects degraded (1.808%) >> 1120305/105270534 objects misplaced (1.064%) >> 3893 active+clean >> 69 active+undersized+degraded+remapped+backfilling >> 6 active+clean+scrubbing >> 3 active+undersized+degraded+remapped+backfill_toofull >> 2 active+clean+scrubbing+deep >> 2 active+undersized+degraded+remapped+wait_backfill+backfill_toofull >> 1 down+peering >> recovery io 248 MB/s, 84 objects/s >> >> We had been running for a while with 107 osd's (not 108), it looks like osd's 64 and 76 are both now down and out at this point. >> >> >> I have looked though the ceph logs for each osd and did not see anything obvious, the raid controller also does not show the disk offline. >> >> I am wondering if I should try to restart the two osd's that are showing as down...or should I wait until the current recovery is complete? > > If the disk is healthy, don't need to wait for it’s completion, that would be dangerous... > >> >> The pool has a replica level of '2'...and with 2 failed disks I want to do whatever I can to make sure there is not an issue with missing objects. >> >> Thanks in advance, >> >> Shain >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com