On Mon, 26 Aug 2013, Travis Rhoden wrote: > Hi Sage, > > Thanks for the response. I noticed that as well, and suspected > hostname/DHCP/DNS shenanigans. What's weird is that all nodes are > identically configured. I also have monitors running on n0 and n12, and > they come up fine, every time. > > Here's the mon_host line from ceph.conf: > > mon_initial_members = n0, n12, n24 > mon_host = 10.0.1.0,10.0.1.12,10.0.1.24 > > just to test /etc/hosts and name resolution... > > root@n24:~# getent hosts n24 > 10.0.1.24 n24 > root@n24:~# hostname -s > n24 > > The only loopback device in /etc/hosts is "127.0.0.1 localhost", so > that should be fine. > > Upon rebooting this node, I've had the monitor come up okay once, maybe out > of 12 tries. So it appears to be some kind of race... No clue what is > going on. If I stop and start the monitor (or restart), it doesn't appear > to change anything. > > However, on the topic of races, I having one other more pressing issue. > Each OSD host is having it's hostname assigned via DHCP. Until that > assignment is made (during init), the hostname is "localhost", and then it > switches over to "n<x>", for some node number. The issue I am seeing is > that there is a race between this hostname assignment and the Ceph Upstart > scripts, such that sometimes ceph-osd starts while the hostname is still > 'localhost'. This then causes the osd location to change in the crushmap, > which is going to be a very bad thing. =) When rebooting all my nodes at > once (there are several dozen), about 50% move from being under n<x> to > localhost. Restarting all the ceph-osd jobs moves them back (because the > hostname is defined). > > I'm wondering what kind of delay, or additional "start-on" logic I can add > to the upstart script to work around this. Hmm, this is beyond my upstart-fu, unfortunately. This has come up before, actually. Previously we would wait for any interface to come up and then start, but that broke with multi-nic machines, and I ended up just making things start in runlevel [2345]. James, do you know what should be done to make the job wait for *all* network interfaces to be up? Is that even the right solution here? sage > > > On Fri, Aug 23, 2013 at 4:47 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > Hi Travis, > > On Fri, 23 Aug 2013, Travis Rhoden wrote: > > Hey folks, > > > > I've just done a brand new install of 0.67.2 on a cluster of > Calxeda nodes. > > > > I have one particular monitor that number joins the quorum > when I restart > > the node. Looks to me like it has something to do with the > "create-keys" > > task, which never seems to finish: > > > > root 1240 1 4 13:03 ? 00:00:02 > /usr/bin/ceph-mon > > --cluster=ceph -i n24 -f > > root 1244 1 0 13:03 ? 00:00:00 > /usr/bin/python > > /usr/sbin/ceph-create-keys --cluster=ceph -i n24 > > > > I don't see that task on my other monitors. Additionally, > that task is > > periodically query the monitor status: > > > > root 1240 1 2 13:03 ? 00:00:02 > /usr/bin/ceph-mon > > --cluster=ceph -i n24 -f > > root 1244 1 0 13:03 ? 00:00:00 > /usr/bin/python > > /usr/sbin/ceph-create-keys --cluster=ceph -i n24 > > root 1982 1244 15 13:04 ? 00:00:00 > /usr/bin/python > > /usr/bin/ceph --cluster=ceph > --admin-daemon=/var/run/ceph/ceph-mon.n24.asok > > mon_status > > > > Checking that status myself, I see: > > > > # ceph --cluster=ceph > --admin-daemon=/var/run/ceph/ceph-mon.n24.asok > > mon_status > > { "name": "n24", > > "rank": 2, > > "state": "probing", > > "election_epoch": 0, > > "quorum": [], > > "outside_quorum": [ > > "n24"], > > "extra_probe_peers": [], > > "sync_provider": [], > > "monmap": { "epoch": 2, > > "fsid": "f0b0d4ec-1ac3-4b24-9eab-c19760ce4682", > > "modified": "2013-08-23 12:55:34.374650", > > "created": "0.000000", > > "mons": [ > > { "rank": 0, > > "name": "n0", > > "addr": "10.0.1.0:6789\/0"}, > > { "rank": 1, > > "name": "n12", > > "addr": "10.0.1.12:6789\/0"}, > > { "rank": 2, > > "name": "n24", > > "addr": "0.0.0.0:6810\/0"}]}} > ^^^^^^^^^^^^^^^^^^^^ > > This is the problem. I can't remember exactly what causes this, > though. > Can you verify the host in ceph.conf mon_host line matches the ip that > is > configured on th machine, and that the /etc/hsots on the machine > doesn't > have a loopback address on it. > > Thanks! > sage > > > > > > > > Any ideas what is going on here? I don't see anything useful in > > /var/log/ceph/ceph-mon.n24.log > > > > Thanks, > > > > - Travis > > > > > > > >
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com