Cool. So far I have tried:
start on (local-filesystems and net-device-up IFACE=eth0)
start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up IFACE=eth1)
About to try:
start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up IFACE=eth1 and started network-services)
The "local-filesystems" + network device is billed as an alternative to runlevel if you need to to do something *after* networking...start on (local-filesystems and net-device-up IFACE=eth0)
start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up IFACE=eth1)
About to try:
start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up IFACE=eth1 and started network-services)
On Mon, Aug 26, 2013 at 2:31 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
Hmm, this is beyond my upstart-fu, unfortunately. This has come upOn Mon, 26 Aug 2013, Travis Rhoden wrote:
> Hi Sage,
>
> Thanks for the response. I noticed that as well, and suspected
> hostname/DHCP/DNS shenanigans. What's weird is that all nodes are
> identically configured. I also have monitors running on n0 and n12, and
> they come up fine, every time.
>
> Here's the mon_host line from ceph.conf:
>
> mon_initial_members = n0, n12, n24
> mon_host = 10.0.1.0,10.0.1.12,10.0.1.24
>
> just to test /etc/hosts and name resolution...
>
> root@n24:~# getent hosts n24
> 10.0.1.24 n24
> root@n24:~# hostname -s
> n24
>
> The only loopback device in /etc/hosts is "127.0.0.1 localhost", so
> that should be fine.
>
> Upon rebooting this node, I've had the monitor come up okay once, maybe out
> of 12 tries. So it appears to be some kind of race... No clue what is
> going on. If I stop and start the monitor (or restart), it doesn't appear
> to change anything.
>
> However, on the topic of races, I having one other more pressing issue.
> Each OSD host is having it's hostname assigned via DHCP. Until that
> assignment is made (during init), the hostname is "localhost", and then it
> switches over to "n<x>", for some node number. The issue I am seeing is
> that there is a race between this hostname assignment and the Ceph Upstart
> scripts, such that sometimes ceph-osd starts while the hostname is still
> 'localhost'. This then causes the osd location to change in the crushmap,
> which is going to be a very bad thing. =) When rebooting all my nodes at
> once (there are several dozen), about 50% move from being under n<x> to
> localhost. Restarting all the ceph-osd jobs moves them back (because the
> hostname is defined).
>
> I'm wondering what kind of delay, or additional "start-on" logic I can add
> to the upstart script to work around this.
before, actually. Previously we would wait for any interface to come up
and then start, but that broke with multi-nic machines, and I ended up
just making things start in runlevel [2345].
James, do you know what should be done to make the job wait for *all*
network interfaces to be up? Is that even the right solution here?
sage
>
>
> On Fri, Aug 23, 2013 at 4:47 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> Hi Travis,
>
> On Fri, 23 Aug 2013, Travis Rhoden wrote:
> > Hey folks,
> >
> > I've just done a brand new install of 0.67.2 on a cluster of
> Calxeda nodes.
> >
> > I have one particular monitor that number joins the quorum
> when I restart
> > the node. Looks to me like it has something to do with the
> "create-keys"
> > task, which never seems to finish:
> >
> > root 1240 1 4 13:03 ? 00:00:02
> /usr/bin/ceph-mon
> > --cluster=ceph -i n24 -f
> > root 1244 1 0 13:03 ? 00:00:00
> /usr/bin/python
> > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
> >
> > I don't see that task on my other monitors. Additionally,
> that task is
> > periodically query the monitor status:
> >
> > root 1240 1 2 13:03 ? 00:00:02
> /usr/bin/ceph-mon
> > --cluster=ceph -i n24 -f
> > root 1244 1 0 13:03 ? 00:00:00
> /usr/bin/python
> > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
> > root 1982 1244 15 13:04 ? 00:00:00
> /usr/bin/python
> > /usr/bin/ceph --cluster=ceph
> --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
> > mon_status
> >
> > Checking that status myself, I see:
> >
> > # ceph --cluster=ceph
> --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
> > mon_status
> > { "name": "n24",
> > "rank": 2,
> > "state": "probing",
> > "election_epoch": 0,
> > "quorum": [],
> > "outside_quorum": [
> > "n24"],
> > "extra_probe_peers": [],
> > "sync_provider": [],
> > "monmap": { "epoch": 2,
> > "fsid": "f0b0d4ec-1ac3-4b24-9eab-c19760ce4682",
> > "modified": "2013-08-23 12:55:34.374650",
> > "created": "0.000000",
> > "mons": [
> > { "rank": 0,
> > "name": "n0",
> > "addr": "10.0.1.0:6789\/0"},
> > { "rank": 1,
> > "name": "n12",
> > "addr": "10.0.1.12:6789\/0"},
> > { "rank": 2,
> > "name": "n24",
> > "addr": "0.0.0.0:6810\/0"}]}}
> ^^^^^^^^^^^^^^^^^^^^
>
> This is the problem. I can't remember exactly what causes this,
> though.
> Can you verify the host in ceph.conf mon_host line matches the ip that
> is
> configured on th machine, and that the /etc/hsots on the machine
> doesn't
> have a loopback address on it.
>
> Thanks!
> sage
>
>
>
>
> >
> > Any ideas what is going on here? I don't see anything useful in
> > /var/log/ceph/ceph-mon.n24.log
> >
> > Thanks,
> >
> > - Travis
> >
> >
>
>
>
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com