Re: Is mon initial members used after the first quorum?

Joao Eduardo Luis <joao@xxxxxxxxxx> · Thu, 11 Dec 2014 10:14:21 +0000

On 12/11/2014 04:18 AM, Christian Balzer wrote:
On Wed, 10 Dec 2014 20:09:01 -0800 Christopher Armstrong wrote:

Christian,

That indeed looks like the bug! We tried with moving the monitor
host/address into global and everything works as expected - see
https://github.com/deis/deis/issues/2711#issuecomment-66566318

This seems like a potentially bad bug - how has it not come up before?

Ah, but as you can see from the issue report is has come up before.
But that discussion as well as that report clearly fell through the cracks.

It's another reason I dislike ceph-deploy, as people using just it
(probably the vast majority) will be unaffected as it stuffs everything
into [global].

People reading the documentation examples or coming from older versions
(and making changes to their config) will get bitten.

I find this extremely weird.  I, as I suppose most devs do, use clusters 
for testing that are not deployed using ceph-deploy.  These are built 
and configured using vstart.sh, which builds a ceph.conf from scratch 
without 'mon initial members' or 'mon hosts', being the monmap derived 
from specific [mon.X] sections.

In any case, I decided to give this a shot and build a local, 3-node 
cluster on addresses 127.0.0.{1,2,3}:6789, using Christopher's 
configuration file, relying as much as possible on specific mon config 
(attached).

You will notice that the main differences between this config file and a 
production config would be the lot of config keys that were overridden 
from default paths to something like 
'/home/ubuntu/tmp/foo/{run,dev,out}/' -- this allows me to run ceph from 
a dev branch instead of having to install it on the system (could have 
used docker but didn't think that was the point).

Anyway, you'll also notice that each mon section has a bunch of config 
options that you wouldn't otherwise see;  this is mostly a dev conf, and 
I copied whatever I found reasonable from vstart.sh-generated ceph.conf.

I also dropped the 'mon initial members = nodo-3' from the [global] 
section.  Keeping it would lead the monitors to be unable to create a 
proper monmap on bootstrap, as each would only know of a single monitor. 
 Besides, the point was to test the specific [mon.X] config sections, 
and if I were to properly config 'mon initial members' we would end up 
in the scenario that Christian is complaining about.

Anyway, I aliased a few lo addresses to make sure each mon gets a 
different ip (albeit local).  Monitors were built relying solely on the 
config file:

for i in 1 2 3; do ceph-mon -i nodo-$i --mkfs -d || break ; done

and were run in much the same way:

for i in 1 2 3; do ceph-mon -i nodo-$i || break ; done

from tailing the logs it was clear the monitors had formed a quorum (one 
was a leader, two were peons), so they were clearly able to build a 
proper monmap and find each other.

Running the 'ceph' tool with '--debug-monc 10' also shows the monitors 
are able to build an initial monmap (and they later on reach the 
monitors for a status report):

ubuntu@terminus:~/tmp/foo$ ceph -s --debug-monc 10
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2014-12-11 09:47:44.114010 7fb4660c0700 10 monclient(hunting): 
build_initial_monmap
2014-12-11 09:47:44.114338 7fb4660c0700 10 monclient(hunting): init
2014-12-11 09:47:44.114345 7fb4660c0700 10 monclient(hunting): 
auth_supported 2 method cephx
2014-12-11 09:47:44.114564 7fb4660c0700 10 monclient(hunting): 
_reopen_session rank -1 name
2014-12-11 09:47:44.114606 7fb4660c0700 10 monclient(hunting): picked 
mon.nodo-2 con 0x7fb4600102f0 addr 127.0.0.2:6789/0
2014-12-11 09:47:44.114623 7fb4660c0700 10 monclient(hunting): 
_send_mon_message to mon.nodo-2 at 127.0.0.2:6789/0

[...]

2014-12-11 09:47:44.151646 7fb4577fe700 10 monclient: 
handle_mon_command_ack 2 [{"prefix": "status"}]
2014-12-11 09:47:44.151648 7fb4577fe700 10 monclient: _finish_command 2 = 0
    cluster fc0e2e09-ade3-4ff6-b23e-f789775b2515
     health HEALTH_ERR
            64 pgs stuck inactive
            64 pgs stuck unclean
            no osds
            mon.nodo-1 low disk space
            mon.nodo-2 low disk space
            mon.nodo-3 low disk space
     monmap e1: 3 mons at 
{nodo-1=127.0.0.1:6789/0,nodo-2=127.0.0.2:6789/0,nodo-3=127.0.0.3:6789/0}
            election epoch 6, quorum 0,1,2 nodo-1,nodo-2,nodo-3
     osdmap e1: 0 osds: 0 up, 0 in
      pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                  64 creating

Next, radosgw:

ubuntu@terminus:~/tmp/foo$ radosgw -d --debug-monc 10
warning: line 19: 'log_file' in section 'global' redefined
2014-12-11 09:51:05.015312 7f6ad66787c0  0 ceph version 
0.89-420-g6f4b98d (6f4b98df317816c11838f0c339be3a8d19d47a25), process 
lt-radosgw, pid 4480
2014-12-11 09:51:05.017022 7f6ad66787c0 10 monclient(hunting): 
build_initial_monmap
2014-12-11 09:51:05.017376 7f6ad66787c0 10 monclient(hunting): init
2014-12-11 09:51:05.017398 7f6ad66787c0 10 monclient(hunting): 
auth_supported 2 method cephx
2014-12-11 09:51:05.017848 7f6ad66787c0 10 monclient(hunting): 
_reopen_session rank -1 name
2014-12-11 09:51:05.017943 7f6ad66787c0 10 monclient(hunting): picked 
mon.nodo-3 con 0x31ab3b0 addr 127.0.0.3:6789/0
2014-12-11 09:51:05.017972 7f6ad66787c0 10 monclient(hunting): 
_send_mon_message to mon.nodo-3 at 127.0.0.3:6789/0
2014-12-11 09:51:05.017985 7f6ad66787c0 10 monclient(hunting): renew_subs
2014-12-11 09:51:05.017987 7f6ad66787c0 10 monclient(hunting): 
authenticate will time out at 2014-12-11 09:56:05.017987
2014-12-11 09:51:05.018776 7f6ac67fc700 10 monclient(hunting): 
handle_monmap mon_map magic: 0 v1
2014-12-11 09:51:05.018796 7f6ac67fc700 10 monclient(hunting):  got 
monmap 1, mon.nodo-3 is now rank 2
2014-12-11 09:51:05.018799 7f6ac67fc700 10 monclient(hunting): dump:
epoch 1
fsid fc0e2e09-ade3-4ff6-b23e-f789775b2515
last_changed 0.000000
created 0.000000
0: 127.0.0.1:6789/0 mon.nodo-1
1: 127.0.0.2:6789/0 mon.nodo-2
2: 127.0.0.3:6789/0 mon.nodo-3

And there you have it.  In this case radosgw won't do anything useful (I 
didn't setup osds for one), but it does build an initial monmap from the 
config file and does find the monitors to obtain the most recent monmap.

I'm having a hard time believing that we are ignoring section-specific 
config options.  If you manage to reproduce this reliably, please do 
provide steps to reproduce.

Cheers!

  -Joao

Christian

Anything we can do to help with a patch?

Chris

On Wed, Dec 10, 2014 at 5:14 PM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

I think this might very well be my poor, unacknowledged bug report:
http://tracker.ceph.com/issues/10012

People with a mon_hosts entry in [global] (as created by ceph-deploy)
will be fine, people with mons specified outside of [global] will not.

Regards,

Christian

On Thu, 11 Dec 2014 00:49:03 +0000 Joao Eduardo Luis wrote:

On 12/10/2014 09:05 PM, Gregory Farnum wrote:
What version is he running?

Joao, does this make any sense to you?

  From the MonMap code I'm pretty sure that the client should have
built the monmap from the [mon.X] sections, and solely based on 'mon
addr'.

'mon_initial_members' is only useful to the monitors anyway, so it
can be disregarded.

Thus, there are two ways for a client to build a monmap:
1) based on 'mon_hosts' on the config (or -m on cli); or
2) based on 'mon addr = ip1,ip2...' from the [mon.X] sections

I don't see a 'mon hosts = ip1,ip2,...' on the config file, and I'm
assuming a '-m ip1,ip2...' has been supplied on the cli, so we would
have been left with the 'mon addr' options on each individual [mon.X]
section.

We are left with two options here: assume there was unexpected
behavior on this code path -- logs or steps to reproduce would be
appreciated in this case! -- or assume something else failed:

- are the ips on the remaining mon sections correct (nodo-1 &&
nodo-2)?
- were all the remaining monitors up and running when the failure
occurred?
- were the remaining monitors reachable by the client?

In case you are able to reproduce this behavior, would be nice if you
could provide logs with 'debug monc = 10' and 'debug ms = 1'.

Cheers!

    -Joao

-Greg

On Wed, Dec 10, 2014 at 11:54 AM, Christopher Armstrong
<chris@xxxxxxxxxxxx> wrote:
Thanks Greg - I thought the same thing, but confirmed with the
user that it appears the radosgw client is indeed using initial
members - when he added all of his hosts to initial members,
things worked just fine. In either event, all of the monitors
were always fully enumerated later in the config file. Is this
potentially a bug specific to radosgw? Here's his config file:

[global]
fsid = fc0e2e09-ade3-4ff6-b23e-f789775b2515
mon initial members = nodo-3
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 3
osd pool default min_size = 1
osd pool default pg_num = 128
osd pool default pgp_num = 128
osd recovery delay start = 15
log file = /dev/stdout
mon_clock_drift_allowed = 1

[mon.nodo-1]
host = nodo-1
mon addr = 192.168.2.200:6789

[mon.nodo-2]
host = nodo-2
mon addr = 192.168.2.201:6789

[mon.nodo-3]
host = nodo-3
mon addr = 192.168.2.202:6789

[client.radosgw.gateway]
host = deis-store-gateway
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
log file = /dev/stdout

On Wed, Dec 10, 2014 at 11:40 AM, Gregory Farnum
<greg@xxxxxxxxxxx> wrote:

On Tue, Dec 9, 2014 at 3:11 PM, Christopher Armstrong
<chris@xxxxxxxxxxxx> wrote:
Hi folks,

I think we have a bit of confusion around how initial members is
used. I understand that we can specify a single monitor (or a
subset of monitors) so
that the cluster can form a quorum when it first comes up. This
is how we're
using the setting now - so the cluster can come up with just one
monitor,
with the other monitors to follow later.

However, a Deis user reported that when the monitor in his
initial members
list went down, radosgw stopped functioning, even though there
are three mons in his config file. I would think that the
radosgw client would connect
to any of the nodes in the config file to get the state of the
cluster, and
that the initial members list is only used when the monitors
first come up
and are trying to achieve quorum.

The issue he filed is here:
https://github.com/deis/deis/issues/2711

He also found this Ceph issue filed:
https://github.com/ceph/ceph/pull/1233

Nope, this has nothing to do with it.

Is that what we're seeing here? Can anyone point us in the right
direction?

I didn't see the actual conf file posted anywhere to look at,
but my guess is simply that (since it looks like you're using
generated conf files which can differ across hosts) that the one
on the server(s) in question don't have the monitors listed in
them. I'm only skimming the code, but from it and my
recollection, when a Ceph client starts up it will try to
assemble a list of monitors to contact from: 1) the contents of
the "mon host" config entry 2) the "mon addr" value in any of
the "global", "mon" or "mon.X" sections

The clients don't even look at mon_initial_members that I can
see, actually — so perhaps your client config only lists the
initial monitor, without adding the others?
-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
http://www.gol.com/

--
Joao Eduardo Luis
Software Engineer | http://ceph.com
[global]
fsid = fc0e2e09-ade3-4ff6-b23e-f789775b2515
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 3
osd pool default min_size = 1
osd pool default pg_num = 128
osd pool default pgp_num = 128
osd recovery delay start = 15
log file = /dev/stdout
mon_clock_drift_allowed = 1

run dir = /home/ubuntu/tmp/foo/run
osd pool default erasure code directory = /home/ubuntu/src/ceph.master/src/.libs
osd pool default erasure code profile = plugin=jerasure technique=reed_sol_van k=2 m=1 ruleset-failure-domain=osd
keyring = /home/ubuntu/tmp/foo/keyring
admin socket = /home/ubuntu/tmp/foo/run/$name.asok
log file = /home/ubuntu/tmp/foo/out/$name.log
pid file = /home/ubuntu/tmp/foo/run/$name.pid

[mon.nodo-1]
host = terminus
mon addr = 127.0.0.1:6789
mon data = /home/ubuntu/tmp/foo/dev/mon.nodo-1
mon pg warn min per osd = 3
mon osd allow primary affinity = true
mon reweight min pgs per osd = 4
mon cluster log file = /home/ubuntu/tmp/out/cluster.mon.$id.log

[mon.nodo-2]
host = terminus
mon addr = 127.0.0.2:6789
mon data = /home/ubuntu/tmp/foo/dev/mon.nodo-2
mon pg warn min per osd = 3
mon osd allow primary affinity = true
mon reweight min pgs per osd = 4
mon cluster log file = /home/ubuntu/tmp/out/cluster.mon.$id.log

[mon.nodo-3]
host = terminus
mon addr = 127.0.0.3:6789 
mon data = /home/ubuntu/tmp/foo/dev/mon.nodo-3
mon pg warn min per osd = 3
mon osd allow primary affinity = true
mon reweight min pgs per osd = 4
mon cluster log file = /home/ubuntu/tmp/out/cluster.mon.$id.log

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com