monitor bootstrap

Sage Weil <sage@xxxxxxxxxxxx> · Sat, 12 Nov 2011 15:41:03 -0800 (PST)

I've reworked the monitor bootstrapping.  It's still a little rough around 
the edges in terms of feeding in initial cluster state, but all the 
monitor refactoring is done so it should be mainly cleanup from here.

The basic bootstrap/mkfs process looks something like this:

 $ ceph-authtool /etc/ceph/keyring --create-keyring --gen-key -n client.admin 
 $ ceph-authtool /etc/ceph/keyring --gen-key -n mon.

and then either

 $ monmaptool /tmp/monmap --create --clobber --add host1 1.2.3.4 --add host2 1.2.3.5 [...]

and on each host

 $ ceph-mon -i `hostname` --mkfs --monmap /tmp/monmap

or define monitors, mon addrs, an fsid (`uuidgen`) in ceph.conf and on 
each host

 $ ceph-mon -i `hostname` --mkfs

On way or another, --mkfs is building an initial "seed" monmap that has an 
fsid and a list of initial monitor addresses.  If you explicitly pass in a 
monmap (generated by monmaptool --create ...) that's pretty clear.  
Alternatively, it will make an initial map based on the --mon-hosts a,b,c 
list of addresses or on what it finds in ceph.conf. (This is the same 
bootstrapping that takes place when a random daemon or tool starts up and 
needs to contact a monitor to authenticate.)  The fsid is required, but 
can come from the generated monmap, command line (--fsid $uuid), or an 
'fsid' option in ceph.conf.

There is likely some tweaking we can do here, particularly with the 
manually address specification step (TV is working on this), but the basic 
requirement is that we have (1) a unique fsid, (2) a list of initial 
monitor addresses, and (3) a keyring with the mon. and client.admin secret 
keys.  Without those the new monitors don't know who to talk to to form 
the new cluster and initialize themelves.

Thereafter, you can add monitors to the cluster the exact same way.  As 
long as the fsid matches, the secret key is valid, and one of the monitors 
in the seed monmap is alive and well, the new monitor will sync itself and 
then add itself to the cluster (by adding itself to the cluster's master 
monmap).

For example, after adding a new [mon.`hostname`] section to your ceph.conf 
with 'mon addr' defined,

 $ ceph auth get mon. -o /tmp/monkey
 $ fsid=`ceph fsid --concise`
 $ ceph-mon -i `hostname` --mkfs -k /tmp/monkey --fsid $fsid
 $ ceph-mon -i `hostname`

will add a new monitor to the cluster.  Here, the new monitor gets its 
peers from ceph.conf and the mon. key and fsid explicitly.  You could also 
pass a recent copy of the monmap instead of relying on ceph.conf (if, say, 
the local ceph.conf doesn't list all monitors).

The vstart.sh script has been switched to use the new process.  Mainly 
this means that the initial osdmap isn't generated beforehand.  Instead, 
when each osd is added, we do something like

 $ n=`ceph osd create --concise`
 $ ceph osd crush add $n osd.$n 1.0 host=localhost rack=localrack pool=default
 $ ceph-osd -i $n --mkfs --mkkey
 $ ceph auth add osd.$n osd "allow *" mon "allow rwx" -i dev/osd$n/keyring 
 $ ceph-osd -i $n

which allocates an osd id, adds it to the crush map, initializes the osd 
data dir and creates a random secret, adds that secret to the monitor auth 
database, and then starts the osd.

One other piece here: currently, when a tool or daemon starts up, we build 
our initial monmap (list of monitor to try to contact) in this order of 
preference:

 1- Was --monmap <fn> specified?  (Normally it's not.)
 2- Was --mon-host <list> specified?  If so, resolve dns names and use 
    that.  Fill in fsid if provided (in ceph.conf or command line; 
    normally it's not).
 3- Look at the 'mon addr' values in the mon.* sections in my ceph.conf to 
    build a list.  Fill in fsid if provided.

The current normal practice is #3, with a ceph.conf on every node that 
had [mon.NNN] sections and mon addr values.  Instead, you can do #2, which 
means you have something like

[global]
	mon host = one.foo.com two.foo.com three.foo.com

One nice thing is that the client will try these at random until it 
connects and authenticates.  Once that happens, it gets the real current 
monmap, which may include hosts not listed here.  That means things like 
adding new monitors don't strictly require that you update ceph.conf all 
over the place (although that's presumably a good thing to do at some 
point).

That's where we are currently in the master branch.  For those of you 
working on the Chef and Juju stuff, if you have feedback on whether there 
are still pain points, now's the time to share!  :)

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html