Re: 2 osd failures

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Sep 2016 13:00:01 +0900

Hello,

Too late I see, but still...

On Tue, 6 Sep 2016 22:17:05 -0400 Shain Miley wrote:

> Hello,
> 
> It looks like we had 2 osd's fail at some point earlier today, here is 
> the current status of the cluster:
>
You will really want to find out how and why that happened, because while
not impossible this is pretty improbable.

Something like HW, are the OSDs on the same host, or maybe an OOM event,
etc.

> root@rbd1:~# ceph -s
>      cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
>       health HEALTH_WARN
>              2 pgs backfill
>              5 pgs backfill_toofull

Bad, you will want your OSDs back in and then some.
Have a look at "ceph osd df".

>              69 pgs backfilling
>              74 pgs degraded
>              1 pgs down
>              1 pgs peering
Not good either. 
W/o bringing back your OSDs that means doom for the data on those PGs.

>              74 pgs stuck degraded
>              1 pgs stuck inactive
>              75 pgs stuck unclean
>              74 pgs stuck undersized
>              74 pgs undersized
>              recovery 1903019/105270534 objects degraded (1.808%)
>              recovery 1120305/105270534 objects misplaced (1.064%)
>              crush map has legacy tunables
>       monmap e1: 3 mons at 
> {hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0}
>              election epoch 282, quorum 0,1,2 hqceph1,hqceph2,hqceph3
>       osdmap e25019: 108 osds: 105 up, 105 in; 74 remapped pgs
>        pgmap v30721368: 3976 pgs, 17 pools, 144 TB data, 51401 kobjects
>              285 TB used, 97367 GB / 380 TB avail
>              1903019/105270534 objects degraded (1.808%)
>              1120305/105270534 objects misplaced (1.064%)
>                  3893 active+clean
>                    69 active+undersized+degraded+remapped+backfilling
>                     6 active+clean+scrubbing
>                     3 active+undersized+degraded+remapped+backfill_toofull
>                     2 active+clean+scrubbing+deep

When in recovery/backfill situations, you always want to stop any and all
scrubbing.

>                     2 
> active+undersized+degraded+remapped+wait_backfill+backfill_toofull
>                     1 down+peering
> recovery io 248 MB/s, 84 objects/s
> 
> We had been running for a while with 107 osd's (not 108), it looks like 
> osd's 64 and 76 are both now down and out at this point.
> 
> 
> I have looked though the ceph logs for each osd and did not see anything 
> obvious, the raid controller also does not show the disk offline.
> 
Get to the bottom of that, normally something gets logged when an OSD
fails.

> I am wondering if I should try to restart the two osd's that are showing 
> as down...or should I wait until the current recovery is complete?
> 
As said, try to restart immediately, just to keep the traffic down for
starters. 

> The pool has a replica level of  '2'...and with 2 failed disks I want to 
> do whatever I can to make sure there is not an issue with missing objects.
>
I sure hope that pool holds backups or something of that nature.

The only times when a replica of 2 isn't a cry for Murphy to smite you is
with RAID backed OSDs or VERY well monitored and vetted SSDs.

> Thanks in advance,
> 
> Shain
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com