Re: 2 osd failures

Christian Balzer <chibi@xxxxxxx> · Thu, 8 Sep 2016 11:34:49 +0900

Hello,

On Wed, 7 Sep 2016 08:38:24 -0400 Shain Miley wrote:

> Well not entirely too late I guess :-(
>
Then re-read my initial reply and see if you can find something in other
logs (syslog/kernel) to explain this.
As well as if those OSDs are all on the same node, maybe have missed
their upgrade, etc.

> I woke up this morning to see that two OTHER osd's had been marked down 
> and out.
> 
> I again restarted the osd daemons and things seem to be ok at this point.
> 
Did you verify that they were still running at that time (ps)?

Also did you look at ceph.log on a MON node to see what their view of this
was?

> I agree that I need to get to the bottom on why this happened.
> 
> I have uploaded the log files from 1 of the downed osd's here:
> 
> http://filebin.ca/2uFoRw017TCD/ceph-osd.51.log.1
> http://filebin.ca/2uFosTO8oHmj/ceph-osd.51.log
> 
These are very sparse. much sparser than what I see with default
parameters when I do a restart.

The three heartbeat check lines don't look good at all and likely the
reason this is happening (other OSDs voting it down).

> You can see my osd restart at about 6:15 am this morning....other than 
> that I don't see anything indicated in the log files (although I could 
> be missing it for sure).
>
See above, ceph.log, when that OSD was declared down. 
Which would be around/after 02:03:08 from the OSD log.

What's happening at that time from the perspective of the rest of the
cluster?

> Just an FYI we are currently running ceph version 0.94.9..which I 
> upgraded to at the end of last week (from 0.94.6 I think)
> 
Only on my test cluster I have 0.94.9, but not much action obviously.
But if this were a regression of sorts, one would think others might
encounter it, too.

Christian

> This cluster is about 2 or 3 years old at this point and we have not run 
> into this issue at all up to this point.
> 
> Thanks,
> 
> Shain
> 
> 
> On 09/07/2016 12:00 AM, Christian Balzer wrote:
> > Hello,
> >
> > Too late I see, but still...
> >
> > On Tue, 6 Sep 2016 22:17:05 -0400 Shain Miley wrote:
> >
> >> Hello,
> >>
> >> It looks like we had 2 osd's fail at some point earlier today, here is
> >> the current status of the cluster:
> >>
> > You will really want to find out how and why that happened, because while
> > not impossible this is pretty improbable.
> >
> > Something like HW, are the OSDs on the same host, or maybe an OOM event,
> > etc.
> >   
> >> root@rbd1:~# ceph -s
> >>       cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
> >>        health HEALTH_WARN
> >>               2 pgs backfill
> >>               5 pgs backfill_toofull
> > Bad, you will want your OSDs back in and then some.
> > Have a look at "ceph osd df".
> >
> >>               69 pgs backfilling
> >>               74 pgs degraded
> >>               1 pgs down
> >>               1 pgs peering
> > Not good either.
> > W/o bringing back your OSDs that means doom for the data on those PGs.
> >
> >>               74 pgs stuck degraded
> >>               1 pgs stuck inactive
> >>               75 pgs stuck unclean
> >>               74 pgs stuck undersized
> >>               74 pgs undersized
> >>               recovery 1903019/105270534 objects degraded (1.808%)
> >>               recovery 1120305/105270534 objects misplaced (1.064%)
> >>               crush map has legacy tunables
> >>        monmap e1: 3 mons at
> >> {hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0}
> >>               election epoch 282, quorum 0,1,2 hqceph1,hqceph2,hqceph3
> >>        osdmap e25019: 108 osds: 105 up, 105 in; 74 remapped pgs
> >>         pgmap v30721368: 3976 pgs, 17 pools, 144 TB data, 51401 kobjects
> >>               285 TB used, 97367 GB / 380 TB avail
> >>               1903019/105270534 objects degraded (1.808%)
> >>               1120305/105270534 objects misplaced (1.064%)
> >>                   3893 active+clean
> >>                     69 active+undersized+degraded+remapped+backfilling
> >>                      6 active+clean+scrubbing
> >>                      3 active+undersized+degraded+remapped+backfill_toofull
> >>                      2 active+clean+scrubbing+deep
> > When in recovery/backfill situations, you always want to stop any and all
> > scrubbing.
> >
> >>                      2
> >> active+undersized+degraded+remapped+wait_backfill+backfill_toofull
> >>                      1 down+peering
> >> recovery io 248 MB/s, 84 objects/s
> >>
> >> We had been running for a while with 107 osd's (not 108), it looks like
> >> osd's 64 and 76 are both now down and out at this point.
> >>
> >>
> >> I have looked though the ceph logs for each osd and did not see anything
> >> obvious, the raid controller also does not show the disk offline.
> >>
> > Get to the bottom of that, normally something gets logged when an OSD
> > fails.
> >
> >> I am wondering if I should try to restart the two osd's that are showing
> >> as down...or should I wait until the current recovery is complete?
> >>
> > As said, try to restart immediately, just to keep the traffic down for
> > starters.
> >
> >> The pool has a replica level of  '2'...and with 2 failed disks I want to
> >> do whatever I can to make sure there is not an issue with missing objects.
> >>
> > I sure hope that pool holds backups or something of that nature.
> >
> > The only times when a replica of 2 isn't a cry for Murphy to smite you is
> > with RAID backed OSDs or VERY well monitored and vetted SSDs.
> >   
> >> Thanks in advance,
> >>
> >> Shain
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com