On 20 Mar 2012, at 08:25, Sage Weil wrote: > On Mon, 19 Mar 2012, Bernard Grymonpon wrote: >> Sage Weil <sage <at> newdream.net> writes: >> >>> >>> On Wed, 7 Mar 2012, David McBride wrote: >>>> On Tue, 2012-03-06 at 13:19 -0800, Tommi Virtanen wrote: >>>> >>>>> - scan the partitions for partition label with the prefix >>>>> "ceph-osd-data-". >>>> >>>> Thought: I'd consider not using a numbered partition label as the >>>> primary identifier for an OSD. >>>> >> >> <snip> >> >>>> To make handling cases like these straightforward, I suspect Ceph may >>>> want to use something functionally equivalent to an MD superblock -- >>>> though in practice, with an OSD, this could simply be a file containing >>>> the appropriate meta-data. >>>> >>>> In fact, I imagine that the OSDs could already contain the necessary >>>> fields -- a reference to their parent cluster's UUID, to ensure foreign >>>> volumes aren't mistakenly mounted; something like mdadm's event-counters >>>> to distinguish between current/historical versions of the same OSD. >>>> (Configuration epoch-count?); a UUID reference to that OSD's journal >>>> file, etc. >>> >>> We're mostly there. Each cluster has a uuid, and each ceph-osd instance >>> gets a uuid when you do ceph-osd --mkfs. That uuid is recorded in the osd >>> data dir and in the journal, so you know that they go together. >>> >>> I think the 'epoch count' type stuff is sort of subsumed by all the osdmap >>> versioning and so forth... are you imagining a duplicate/backup instance >>> of an osd drive getting plugged in or something? We don't guard for >>> that, but I'm not sure offhand how we would. :/ >>> >>> Anyway, I suspect the missing piece here is to incorporate the uuids into >>> the path names somehow. >> >> I would discourage using the disk-labels, as you might not always be able to >> set these (consider imported luns from other storage boxes, or some internal >> regulations in labeling disks...). I would trust the sysadmin to know which >> mounts go where to get everything in place (he himself can use the labels in >> his fstab or some clever bootscript), and then use the ceph-metadata to start >> only "sane" OSDs/MONs/... > > The goal is to make this optional. I.e., provide tools to use via udev > to mount disks in good locations based on labels, but not require them if > the sysadmin has some other idea about how it should be done. > > Ideally, the start/stop scripts should be able to look in /var/lib/ceph > and start daemons for whatever it sees there that looks sane. Start/stop scripts should not be that intelligent in my opinion - a start/stop script should just start or stop whatever it is told to start/stop (usually through a simple config file, pointing to the correct folders). If a sysadmin decides to make a backup copy of some data in /var/lib/ceph, it should not result in suddenly spawning new instances... Also, a start/stop script should start/stop very clearly a certain osd/mon... you don't want to restart/start/stop each and every daemon every time (and the optional third parameter to start/stop script is non-common). > >> In my opinion, a OSD should be able to figure out himself if he has a "good" >> dataset to "boot" with - and it is up to the mon to either reject or accept >> this OSD as a good/valid part of the cluster, or if it needs re-syncing. > > Yes. > >>> TV wrote: > > - FHS says human-editable configuration goes in /etc > > - FHS >> says machine-editable state goes in /var/lib/ceph > > - use >> /var/lib/ceph/mon/$id/ for mon.$id > > - use /var/lib/ceph/osd-journal/$id for >> osd.$id journal; symlink to > > actual location > > - use >> /var/lib/ceph/osd-data/$id for osd.$id data; may be a symlink to > > actual >> location? > > I wonder if these should be something like > > >> /var/lib/ceph/$cluster_uuid/mon/$id > >> /var/lib/ceph/$cluster_uuid/osd-data/$osd_uuid.$id > >> /var/lib/ceph/$cluster_uuid/osd-journal/$osd_uuid.$id >> >> The numbering of the MON/OSD's is a bit a hassle now, best would be (in my >> opinion) >> >> /var/lib/ceph/$cluster_uuid/osd/$osd_uuid/data >> /var/lib/ceph/$cluster_uuid/osd/$osd_uuid/journal >> /var/lib/ceph/$cluster_uuid/osd/$mon_uuid/ >> >> Journal and data go together for the OSD - so no need to split these on a >> lower level. One can't have a OSD without both, so seems fair to put them next >> to each other... > > Currently the ceph-osd it told which id to be on startup; the only real > shift here would be to let you specify some uuids instead and have it pull > it's rank (id) out of the .../whoami file. > > Monitors have user-friendly names ('foo', `hostname`). We could add uuids > there too, but I'm less sure how useful that'll be. Consistency would be the word you're looking for... both in ceph and in the storage-field. The storage ops is used to long random strings identifying parts (luns, identifiers, ...). Allowing the sysadmin to specify the UUID themselfs would give the best of both worlds: lazy admins use the generated UUIDs, others generate their own (I can image that having the node identified by its hostname, or some other label might be useful in a 10+ node cluster...). > >>> so that cluster instances don't stomp on one another. OTOH, that would >>> imply that we should do something like >>> >>> /etc/ceph/$cluster_uuid/ceph.conf, keyring, etc. >> >> Ack, although at cluster creation, the cluster_uuid is unknown, which kind of >> gives a chicken-egg situation. > > Making the mkfs process take the cluster_uuid as input is easy, although > it makes it possible for a bad sysadmin to share a uuid across clusters. Don't care for bad sysadmins :) > > >> As I've been constructing some cookbooks to setup a default cluster, this is >> what I bumped into: >> >> - the numbering (0, 1, ...) of the OSDs and their need to keep the same number >> throughout the lifetime of the cluster is a bit a hassle. Each OSD needs to >> have a complete view of all the components of the cluster before it can >> determine it's own ID. A random, auto-generated UUID would be nice (I >> currently solved this by assigning each cluster a global "clustername", and >> search the chef server for all nodes, look for the highest indexed OSDs, and >> increment this to determine the new OSD's index - there must be a better >> way). > > The 'ceph osd create' command will handle the allocation of a new unique > id for you. We could supplement that with a uuid to make it a bit more > robust (if we add the osd uuids to the osdmap... probbaly a good idea > anyway). For this to work, you need a connection to the monitor(s), which gives security issues, and make the creation of a OSD a two-node-operation. A OSD should generate a UUID itself, and that is his one and only identifier. Once joined in a cluster for the first time, it might record the cluster uuid in its metadata. If the uuid of the osd clashes with an existing uuid the mon should reject it. > >> - the configfile needs to be the same on all hosts - which is only partially >> true. From my point of view, a OSD should only have some way of contacting >> one mon, which would inform the OSD of the cluster layout. So, only the >> mon-info should be there (together with the info for the OSD itself, >> obviously) > > It doesn't, actually; it's only need to bootstrap (to find the monitor(s) > on startup) and to set any config values that are non-default. The > current start/stop script wants to see the loca instances there, but that > can be replaced by looking for directories in /var/lib/ceph/. > >> - there is a chicken-egg problem in the authentication of a osd to the mon. An >> OSD should have permission to join the mon, for which we need to add the OSD >> to the mon. As chef works on the node, and can't trigger stuff on other >> nodes, the node that will hold the OSD needs some way of authenticating >> itself to the mon (I solved this by storing the "client.admin" secret on the >> mon-node, and then pulling this from there on the osd node, and using it to >> register myself to the mon. It is like putting a copy of your homekey on >> your front door...). I see no obvious solution here. > > We've set up a special key that has permission to create new osds only, > but again it's pretty bad security. Chef's model just doesn't work well > here. There will always be some sort of "master key" for the cluster to create/accept new instances (either this, or no security at all). I don't see a way around it (or you will give up parts of the security model). The more i think about it, all of the security between mons and osds is a bit strange - most of the time your storage cluster will be on a isolated, dedicated network (the private network/public network parameters do this already). Security and rights towards the client nodes is needed,... > >> - the current (debian) start/stop scripts are a hassle to work with, as chef >> doesn't understand the third parameter (/etc/init.d/ceph start mon.0). Each >> mon / osd / ... should have its own start/stop script. >> >> - there should be some way to ask a local running OSD/MON for its status, >> without having to go through the monitor-nodes. Sort of "ceph-local-daemon >> --uuid=xxx --type=mon status", which would inform us if it is running, >> healthy, part of the cluster, lost in space... > > Each daemon has a socket in /var/run/ceph to communicate with it; adding a > health command would be pretty straightforward. > >> - growing the cluster bit by bit would be ideal, this is how chef works (it >> handles node per node, not a bunch of nodes in one go) > > This works now, with the exception of monitor cluster bootstrap being > awkward. How is the initial amount of pgs determined. If you start with no OSDs, and add them, do the pgs grow? > >> - ideal, there would be a automatic-crushmap-expansion command which would add >> a device to an existing crushmap (or remove one). Now, the crushmap needs to >> be reconstructed completely, and if your numbering changes somehow, you're >> screwed. Ideal would be "take the current crushmap and add OSD with uuid >> xxx" - "take the current crushmap and remove OSD xxx" > > You can do this now, too: > > ceph osd crush add <$osdnum> <osd.$osdnum> <weight> host=foo rack=bar [...] > > The crush map has a alphanumeric name that crush ignores (at least for > devices), although osd.$num is what we generate by default. The > keys/values are crush types for other levels of the hierarchy, so that you > can specify where in the tree the new item should be placed. Nice, I'll have a look at this later this week. > > The questions for me now are what we should use for default locations and > document as best practice. > > - do we want $cluster_uuid all over the place? In my opinion - no. I don't see a single machine serving two clusters at once. Only in very special test-cases this might be the case. If a OSD knows which cluster it belongs to (and records this in his metadata), that would be fine. > - should we allow osds to be started by $uuid instead of rank? Yes, please. Numbering things is a pain, if you don't have/control all the nodes at once. > - is it sufficient for init scripts to blindly start everything in > /var/lib/ceph, or do we need equivalent functionality to the 'auto > start = false' in ceph.conf (that Wido is using)? > - is a single init script still appropriate, or do we want something > better? (I'm not very familiar with the new best practices for upstart > or systemd for multi-instance services like this.) Start/stop scripts should be stupid in my opinion - see above. > - uuids for monitors? yes > - osd uuids in osdmap? yes, loose the "rank" completely if possible. Rgds, Bernard > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html