Hello, On Wed, 7 Sep 2016 08:38:24 -0400 Shain Miley wrote: > Well not entirely too late I guess :-( > Then re-read my initial reply and see if you can find something in other logs (syslog/kernel) to explain this. As well as if those OSDs are all on the same node, maybe have missed their upgrade, etc. > I woke up this morning to see that two OTHER osd's had been marked down > and out. > > I again restarted the osd daemons and things seem to be ok at this point. > Did you verify that they were still running at that time (ps)? Also did you look at ceph.log on a MON node to see what their view of this was? > I agree that I need to get to the bottom on why this happened. > > I have uploaded the log files from 1 of the downed osd's here: > > http://filebin.ca/2uFoRw017TCD/ceph-osd.51.log.1 > http://filebin.ca/2uFosTO8oHmj/ceph-osd.51.log > These are very sparse. much sparser than what I see with default parameters when I do a restart. The three heartbeat check lines don't look good at all and likely the reason this is happening (other OSDs voting it down). > You can see my osd restart at about 6:15 am this morning....other than > that I don't see anything indicated in the log files (although I could > be missing it for sure). > See above, ceph.log, when that OSD was declared down. Which would be around/after 02:03:08 from the OSD log. What's happening at that time from the perspective of the rest of the cluster? > Just an FYI we are currently running ceph version 0.94.9..which I > upgraded to at the end of last week (from 0.94.6 I think) > Only on my test cluster I have 0.94.9, but not much action obviously. But if this were a regression of sorts, one would think others might encounter it, too. Christian > This cluster is about 2 or 3 years old at this point and we have not run > into this issue at all up to this point. > > Thanks, > > Shain > > > On 09/07/2016 12:00 AM, Christian Balzer wrote: > > Hello, > > > > Too late I see, but still... > > > > On Tue, 6 Sep 2016 22:17:05 -0400 Shain Miley wrote: > > > >> Hello, > >> > >> It looks like we had 2 osd's fail at some point earlier today, here is > >> the current status of the cluster: > >> > > You will really want to find out how and why that happened, because while > > not impossible this is pretty improbable. > > > > Something like HW, are the OSDs on the same host, or maybe an OOM event, > > etc. > > > >> root@rbd1:~# ceph -s > >> cluster 504b5794-34bd-44e7-a8c3-0494cf800c23 > >> health HEALTH_WARN > >> 2 pgs backfill > >> 5 pgs backfill_toofull > > Bad, you will want your OSDs back in and then some. > > Have a look at "ceph osd df". > > > >> 69 pgs backfilling > >> 74 pgs degraded > >> 1 pgs down > >> 1 pgs peering > > Not good either. > > W/o bringing back your OSDs that means doom for the data on those PGs. > > > >> 74 pgs stuck degraded > >> 1 pgs stuck inactive > >> 75 pgs stuck unclean > >> 74 pgs stuck undersized > >> 74 pgs undersized > >> recovery 1903019/105270534 objects degraded (1.808%) > >> recovery 1120305/105270534 objects misplaced (1.064%) > >> crush map has legacy tunables > >> monmap e1: 3 mons at > >> {hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0} > >> election epoch 282, quorum 0,1,2 hqceph1,hqceph2,hqceph3 > >> osdmap e25019: 108 osds: 105 up, 105 in; 74 remapped pgs > >> pgmap v30721368: 3976 pgs, 17 pools, 144 TB data, 51401 kobjects > >> 285 TB used, 97367 GB / 380 TB avail > >> 1903019/105270534 objects degraded (1.808%) > >> 1120305/105270534 objects misplaced (1.064%) > >> 3893 active+clean > >> 69 active+undersized+degraded+remapped+backfilling > >> 6 active+clean+scrubbing > >> 3 active+undersized+degraded+remapped+backfill_toofull > >> 2 active+clean+scrubbing+deep > > When in recovery/backfill situations, you always want to stop any and all > > scrubbing. > > > >> 2 > >> active+undersized+degraded+remapped+wait_backfill+backfill_toofull > >> 1 down+peering > >> recovery io 248 MB/s, 84 objects/s > >> > >> We had been running for a while with 107 osd's (not 108), it looks like > >> osd's 64 and 76 are both now down and out at this point. > >> > >> > >> I have looked though the ceph logs for each osd and did not see anything > >> obvious, the raid controller also does not show the disk offline. > >> > > Get to the bottom of that, normally something gets logged when an OSD > > fails. > > > >> I am wondering if I should try to restart the two osd's that are showing > >> as down...or should I wait until the current recovery is complete? > >> > > As said, try to restart immediately, just to keep the traffic down for > > starters. > > > >> The pool has a replica level of '2'...and with 2 failed disks I want to > >> do whatever I can to make sure there is not an issue with missing objects. > >> > > I sure hope that pool holds backups or something of that nature. > > > > The only times when a replica of 2 isn't a cry for Murphy to smite you is > > with RAID backed OSDs or VERY well monitored and vetted SSDs. > > > >> Thanks in advance, > >> > >> Shain > >> > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com