Re: defaults paths #2

Tommi Virtanen <tommi.virtanen@xxxxxxxxxxxxx> · Fri, 6 Apr 2012 10:57:46 -0700

On Fri, Apr 6, 2012 at 00:37, Bernard Grymonpon <bernard@xxxxxxxxxxxx> wrote:
> Storage lives by UUID's, I would suggest to move the programming in the direction to get rid of all the own naming and labeling, and just stick to uuids. I could not care less if the data on that disk is internally named "23" or "678", it is just a part of my cluster, and it is up to ceph to figure out which part of the puzzle it holds.

Ceph storage does not live by UUIDs. Each object is stored in specific
osds, which are identified by a sequential, dense, unique, integer.
These integers exist, are allocated and freed, and are the very
*definition* of an osd.

UUIDs are useful for ensuring 1. that osd.42 from cluster A is not
confused with a osd.42 from cluster B (cluster-wide fsid)  2. for
ensuring we use the right journal for an osd (per-osd uuid, stored in
both osd data and journal).

UUIDs are not human friendly, and are not a native identifier in Ceph.

Here's what running multiple osds looks like with the new upstart stuff:

ubuntu@inst01:~$ sudo initctl list|grep ceph
ceph-osd-all stop/waiting
ceph-mon-all stop/waiting
ceph-mon (ceph/single) start/running, process 2527
ceph-osd (ceph/0) start/running, process 2576
ceph-osd (ceph/1) start/running, process 2580
ceph-osd (ceph/2) start/running, process 2592
ceph-osd (ceph/3) start/running, process 3053

Doing the same with UUIDs would make that output have lines like

ceph-osd (ceph/1bbb1f49-c574-45b0-b4cc-43dbe7bacc0d) start/running, process 2576

and I really, really don't want for no gain.

UUIDs make a lot of sense when you don't have central coordination of
identifiers. But Ceph *has* that. (It *needs* that, because we don't
do lookup tables.)

>> I think many people with that hardware setup will choose to just GPT
>> partition the SSD, and have a lot less code in the way of the IO.
> A single file, sitting on some partition somewhere, will be very hard to auto-find, unless you'll keep some info in OSD that tracks the UUID of the disk + the path inside the filesystem where it previously was.

For example: "All the journals that are just files are inside
/var/lib/ceph/journal, or are symlinked there." And that can even be a
configurable search path with multiple entries. Not that hard.

OSD data will not contain a path name to the journal, that is too transient.

>> I think we agree on intents, but I disagree strongly with the words
>> you chose. The sysadmin should not need to symlink anything. We will
>> bring up a whole cluster, and enable you to manage that easily.
>> Anything less is a failure. You may choose to opt out of some of that
>> higher smarts, if you want to do it differently, but we will need to
>> provide that. Managing a big cluster without automation is painful,
>> and people shouldn't need to reinvent the wheel.
>
> Actually, it will bring up one node of a cluster, regardless of the state of the other parts of the cluster. Question is how automated everything should be detected and started.
>
> Hotplugging and auto-mounting is there already - it is up to the sysadmin to use it if needed.

The automation I'm working is intended to bring up and manage a *whole
cluster*. Not a single machine.

Hotplugging is *not* there already. There is no turn-key
installs-in-under-60-minutes solution that gives you that.

> Same goes the other way - if someone like to start everything automagically, there is nothing that stops others from using the normal hotplug tools available, just write some rules to act on the presence of certain disklabels.

I am writing that logic, right now. For everyone to use. So that
everyone doesn't need to reinvent the wheel.

> Ceph can provide scripts to do this, but I would keep this logic out of the default behavior of ceph-osd. In my opinion, this is a storage daemon - not a storage-node-management-daemon.
>
> If you want to keep this outside ceph-osd, and provide this as some helper to the /etc/init.d scripts, then I'm all the way with you, as long as the magic .

It will all be optional, though you will be encouraged to use the automation.

> In my opinion the config file for ceph should contain only UUIDs for the devices which hold the data, like:
>
> ...
> [osd]
>  uuid=abc-123-123-123
>  uuid=def-456-456-456
>  uuid=ghi-789-789-789
> ...

We are talking about hundreds of machines and thousands of disks.
Nobody I've talked to wants to edit a config file to add/replace a
disk on a system that big, that's just not the way to go.

> If someone removes or pops in a disk, what should happen then is up to me. If it starts everything automatically, or if a certain node isn't set to auto-boot, is up to me (and should result in changes in the config in my chef-server, which controls all my nodes). If I want all these UUIDs to auto-start a OSD when inserted, then I would let chef write a rules file in udev to do exactly that.

It seems you have different enough opinions about how to manage your
systems that you might choose to not to use the automation I've been
working on. That's fine, just do it. The core ceph daemons will not
assume things either way.

I would recommend you come back in a few months and see what come out
of the work, you might find it more acceptable than you currently
think.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html