Re: 2 osd failures

lyt_yudi <lyt_yudi@xxxxxxxxxx> · Wed, 07 Sep 2016 10:39:12 +0800

hi,

> 在 2016年9月7日，上午10:17，Shain Miley <smiley@xxxxxxx> 写道：
> 
> Hello,
> 
> It looks like we had 2 osd's fail at some point earlier today, here is the current status of the cluster:
> 
> root@rbd1:~# ceph -s
>    cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
>     health HEALTH_WARN
>            2 pgs backfill
>            5 pgs backfill_toofull
>            69 pgs backfilling
>            74 pgs degraded
>            1 pgs down
>            1 pgs peering
>            74 pgs stuck degraded
>            1 pgs stuck inactive
>            75 pgs stuck unclean
>            74 pgs stuck undersized
>            74 pgs undersized
>            recovery 1903019/105270534 objects degraded (1.808%)
>            recovery 1120305/105270534 objects misplaced (1.064%)
>            crush map has legacy tunables
>     monmap e1: 3 mons at {hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0}
>            election epoch 282, quorum 0,1,2 hqceph1,hqceph2,hqceph3
>     osdmap e25019: 108 osds: 105 up, 105 in; 74 remapped pgs
>      pgmap v30721368: 3976 pgs, 17 pools, 144 TB data, 51401 kobjects
>            285 TB used, 97367 GB / 380 TB avail
>            1903019/105270534 objects degraded (1.808%)
>            1120305/105270534 objects misplaced (1.064%)
>                3893 active+clean
>                  69 active+undersized+degraded+remapped+backfilling
>                   6 active+clean+scrubbing
>                   3 active+undersized+degraded+remapped+backfill_toofull
>                   2 active+clean+scrubbing+deep
>                   2 active+undersized+degraded+remapped+wait_backfill+backfill_toofull
>                   1 down+peering
> recovery io 248 MB/s, 84 objects/s
> 
> We had been running for a while with 107 osd's (not 108), it looks like osd's 64 and 76 are both now down and out at this point.
> 
> 
> I have looked though the ceph logs for each osd and did not see anything obvious, the raid controller also does not show the disk offline.
> 
> I am wondering if I should try to restart the two osd's that are showing as down...or should I wait until the current recovery is complete?

If the disk is healthy, don't need to wait for it’s completion, that would be dangerous...

> 
> The pool has a replica level of  '2'...and with 2 failed disks I want to do whatever I can to make sure there is not an issue with missing objects.
> 
> Thanks in advance,
> 
> Shain
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com