Re: Weird cluster restart behavior

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 31 Mar 2015 13:05:12 -0700

On Tue, Mar 31, 2015 at 12:56 PM, Quentin Hartman
<qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
> Thanks for the extra info Gregory. I did not also set nodown.
>
> I expect that I will be very rarely shutting everything down in the normal
> course of things, but it has come up a couple times when having to do some
> physical re-organizing of racks. Little irritants like this aren't a big
> deal if people know to expect them, but as it is I lost quite a lot of time
> troubleshooting a non-existant problem. What's the best way to get notes to
> that effect added to the docs? It seems something in
> http://ceph.com/docs/master/rados/operations/operating/ would save some
> people some headache. I'm happy to propose edits, but a quick look doesn't
> reveal a process for submitting that sort of thing.

Github pull requests. :)

>
> My understanding is that the "right" method to take an entire cluster
> offline is to set noout and then shutting everything down. Is there a better
> way?

That's probably the best way to do it. Like I said, there was also a
bug here that I think is fixed for Hammer but that might not have been
backported to Giant. Unfortunately I don't remember the right keywords
as I wasn't involved in the fix.
-Greg

>
> QH
>
> On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>
>> On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman
>> <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
>> > I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1.
>> > Last
>> > friday I got everything deployed and all was working well, and I set
>> > noout
>> > and shut all the OSD nodes down over the weekend. Yesterday when I spun
>> > it
>> > back up, the OSDs were behaving very strangely, incorrectly marking each
>> > other because of missed heartbeats, even though they were up. It looked
>> > like
>> > some kind of low-level networking problem, but I couldn't find any.
>> >
>> > After much work, I narrowed the apparent source of the problem down to
>> > the
>> > OSDs running on the first host I started in the morning. They were the
>> > ones
>> > that were logged the most messages about not being able to ping other
>> > OSDs,
>> > and the other OSDs were mostly complaining about them. After running out
>> > of
>> > other ideas to try, I restarted them, and then everything started
>> > working.
>> > It's still working happily this morning. It seems as though when that
>> > set of
>> > OSDs started they got stale OSD map information from the MON boxes,
>> > which
>> > failed to be updated as the other OSDs came up. Does that make sense? I
>> > still don't consider myself an expert on ceph architecture and would
>> > appreciate and corrections or other possible interpretations of events
>> > (I'm
>> > happy to provide whatever additional information I can) so I can get a
>> > deeper understanding of things. If my interpretation of events is
>> > correct,
>> > it seems that might point at a bug.
>>
>> I can't find the ticket now, but I think we did indeed have a bug
>> around heartbeat failures when restarting nodes. This has been fixed
>> in other branches but might have been missed for giant. (Did you by
>> any chance set the nodown flag as well as noout?)
>>
>> In general Ceph isn't very happy with being shut down completely like
>> that and its behaviors aren't validated, so nothing will go seriously
>> wrong but you might find little irritants like this. It's particularly
>> likely when you're prohibiting state changes with the noout/nodown
>> flags.
>> -Greg
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com