Re: 2 osd failures

Shain Miley <SMiley@xxxxxxx> · Wed, 7 Sep 2016 03:52:25 +0000

I restarted both osd daemons and things are back to normal.

I'm not sure why they failed in the first place but I'll keep looking.

Thanks!

Shain

Sent from my iPhone

> On Sep 6, 2016, at 10:39 PM, lyt_yudi <lyt_yudi@xxxxxxxxxx> wrote:
> 
> hi,
> 
>> 在 2016年9月7日，上午10:17，Shain Miley <smiley@xxxxxxx> 写道：
>> 
>> Hello,
>> 
>> It looks like we had 2 osd's fail at some point earlier today, here is the current status of the cluster:
>> 
>> root@rbd1:~# ceph -s
>>   cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
>>    health HEALTH_WARN
>>           2 pgs backfill
>>           5 pgs backfill_toofull
>>           69 pgs backfilling
>>           74 pgs degraded
>>           1 pgs down
>>           1 pgs peering
>>           74 pgs stuck degraded
>>           1 pgs stuck inactive
>>           75 pgs stuck unclean
>>           74 pgs stuck undersized
>>           74 pgs undersized
>>           recovery 1903019/105270534 objects degraded (1.808%)
>>           recovery 1120305/105270534 objects misplaced (1.064%)
>>           crush map has legacy tunables
>>    monmap e1: 3 mons at {hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0}
>>           election epoch 282, quorum 0,1,2 hqceph1,hqceph2,hqceph3
>>    osdmap e25019: 108 osds: 105 up, 105 in; 74 remapped pgs
>>     pgmap v30721368: 3976 pgs, 17 pools, 144 TB data, 51401 kobjects
>>           285 TB used, 97367 GB / 380 TB avail
>>           1903019/105270534 objects degraded (1.808%)
>>           1120305/105270534 objects misplaced (1.064%)
>>               3893 active+clean
>>                 69 active+undersized+degraded+remapped+backfilling
>>                  6 active+clean+scrubbing
>>                  3 active+undersized+degraded+remapped+backfill_toofull
>>                  2 active+clean+scrubbing+deep
>>                  2 active+undersized+degraded+remapped+wait_backfill+backfill_toofull
>>                  1 down+peering
>> recovery io 248 MB/s, 84 objects/s
>> 
>> We had been running for a while with 107 osd's (not 108), it looks like osd's 64 and 76 are both now down and out at this point.
>> 
>> 
>> I have looked though the ceph logs for each osd and did not see anything obvious, the raid controller also does not show the disk offline.
>> 
>> I am wondering if I should try to restart the two osd's that are showing as down...or should I wait until the current recovery is complete?
> 
> If the disk is healthy, don't need to wait for it’s completion, that would be dangerous...
> 
>> 
>> The pool has a replica level of  '2'...and with 2 failed disks I want to do whatever I can to make sure there is not an issue with missing objects.
>> 
>> Thanks in advance,
>> 
>> Shain
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com