Re: Weird cluster restart behavior

Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> · Tue, 31 Mar 2015 13:56:34 -0600

Thanks for the extra info Gregory. I did not also set nodown.
I expect that I will be very rarely shutting everything down in the normal course of things, but it has come up a couple times when having to do some physical re-organizing of racks. Little irritants like this aren't a big deal if people know to expect them, but as it is I lost quite a lot of time troubleshooting a non-existant problem. What's the best way to get notes to that effect added to the docs? It seems something in http://ceph.com/docs/master/rados/operations/operating/ would save some people some headache. I'm happy to propose edits, but a quick look doesn't reveal a process for submitting that sort of thing.

My understanding is that the "right" method to take an entire cluster offline is to set noout and then shutting everything down. Is there a better way?

QH

On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman

<qhartman@xxxxxxxxxxxxxxxxxxx> wrote:

> I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last

> friday I got everything deployed and all was working well, and I set noout

> and shut all the OSD nodes down over the weekend. Yesterday when I spun it

> back up, the OSDs were behaving very strangely, incorrectly marking each

> other because of missed heartbeats, even though they were up. It looked like

> some kind of low-level networking problem, but I couldn't find any.

>

> After much work, I narrowed the apparent source of the problem down to the

> OSDs running on the first host I started in the morning. They were the ones

> that were logged the most messages about not being able to ping other OSDs,

> and the other OSDs were mostly complaining about them. After running out of

> other ideas to try, I restarted them, and then everything started working.

> It's still working happily this morning. It seems as though when that set of

> OSDs started they got stale OSD map information from the MON boxes, which

> failed to be updated as the other OSDs came up. Does that make sense? I

> still don't consider myself an expert on ceph architecture and would

> appreciate and corrections or other possible interpretations of events (I'm

> happy to provide whatever additional information I can) so I can get a

> deeper understanding of things. If my interpretation of events is correct,

> it seems that might point at a bug.

I can't find the ticket now, but I think we did indeed have a bug

around heartbeat failures when restarting nodes. This has been fixed

in other branches but might have been missed for giant. (Did you by

any chance set the nodown flag as well as noout?)

In general Ceph isn't very happy with being shut down completely like

that and its behaviors aren't validated, so nothing will go seriously

wrong but you might find little irritants like this. It's particularly

likely when you're prohibiting state changes with the noout/nodown

flags.

-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com