Weird cluster restart behavior

Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> · Tue, 31 Mar 2015 08:50:45 -0600

I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last friday I got everything deployed and all was working well, and I set noout and shut all the OSD nodes down over the weekend. Yesterday when I spun it back up, the OSDs were behaving very strangely, incorrectly marking each other because of missed heartbeats, even though they were up. It looked like some kind of low-level networking problem, but I couldn't find any.
After much work, I narrowed the apparent source of the problem down to the OSDs running on the first host I started in the morning. They were the ones that were logged the most messages about not being able to ping other OSDs, and the other OSDs were mostly complaining about them. After running out of other ideas to try, I restarted them, and then everything started working. It's still working happily this morning. It seems as though when that set of OSDs started they got stale OSD map information from the MON boxes, which failed to be updated as the other OSDs came up. Does that make sense? I still don't consider myself an expert on ceph architecture and would appreciate and corrections or other possible interpretations of events (I'm happy to provide whatever additional information I can) so I can get a deeper understanding of things. If my interpretation of events is correct, it seems that might point at a bug.

QH
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com