On Tue, Mar 31, 2015 at 12:56 PM, Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> wrote: > Thanks for the extra info Gregory. I did not also set nodown. > > I expect that I will be very rarely shutting everything down in the normal > course of things, but it has come up a couple times when having to do some > physical re-organizing of racks. Little irritants like this aren't a big > deal if people know to expect them, but as it is I lost quite a lot of time > troubleshooting a non-existant problem. What's the best way to get notes to > that effect added to the docs? It seems something in > http://ceph.com/docs/master/rados/operations/operating/ would save some > people some headache. I'm happy to propose edits, but a quick look doesn't > reveal a process for submitting that sort of thing. Github pull requests. :) > > My understanding is that the "right" method to take an entire cluster > offline is to set noout and then shutting everything down. Is there a better > way? That's probably the best way to do it. Like I said, there was also a bug here that I think is fixed for Hammer but that might not have been backported to Giant. Unfortunately I don't remember the right keywords as I wasn't involved in the fix. -Greg > > QH > > On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: >> >> On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman >> <qhartman@xxxxxxxxxxxxxxxxxxx> wrote: >> > I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. >> > Last >> > friday I got everything deployed and all was working well, and I set >> > noout >> > and shut all the OSD nodes down over the weekend. Yesterday when I spun >> > it >> > back up, the OSDs were behaving very strangely, incorrectly marking each >> > other because of missed heartbeats, even though they were up. It looked >> > like >> > some kind of low-level networking problem, but I couldn't find any. >> > >> > After much work, I narrowed the apparent source of the problem down to >> > the >> > OSDs running on the first host I started in the morning. They were the >> > ones >> > that were logged the most messages about not being able to ping other >> > OSDs, >> > and the other OSDs were mostly complaining about them. After running out >> > of >> > other ideas to try, I restarted them, and then everything started >> > working. >> > It's still working happily this morning. It seems as though when that >> > set of >> > OSDs started they got stale OSD map information from the MON boxes, >> > which >> > failed to be updated as the other OSDs came up. Does that make sense? I >> > still don't consider myself an expert on ceph architecture and would >> > appreciate and corrections or other possible interpretations of events >> > (I'm >> > happy to provide whatever additional information I can) so I can get a >> > deeper understanding of things. If my interpretation of events is >> > correct, >> > it seems that might point at a bug. >> >> I can't find the ticket now, but I think we did indeed have a bug >> around heartbeat failures when restarting nodes. This has been fixed >> in other branches but might have been missed for giant. (Did you by >> any chance set the nodown flag as well as noout?) >> >> In general Ceph isn't very happy with being shut down completely like >> that and its behaviors aren't validated, so nothing will go seriously >> wrong but you might find little irritants like this. It's particularly >> likely when you're prohibiting state changes with the noout/nodown >> flags. >> -Greg > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com