defaults paths #2

Tommi Virtanen <tommi.virtanen@xxxxxxxxxxxxx> · Thu, 5 Apr 2012 16:36:39 -0700

[Apologies for starting a new thread, vger unsubscribed me without
warning, I'm reading the previous thread via web]

In response to the thread at http://marc.info/?t=133360781700001&r=1&w=2

Sage:
> The locations could be:
>  keyring:
>   /etc/ceph/$cluster.keyring  (fallback to /etc/ceph/keyring)

I think all the osds and mons will have their secrets inside their
data dir. This, if used, will be just for the command line tools.

>  osd_data, mon_data:
>   /var/lib/ceph/$cluster.$name
>   /var/lib/ceph/$cluster/$name
>   /var/lib/ceph/data/$cluster.$name
>   /var/lib/ceph/$type-data/$cluster-$id

I'm thinking.. /var/lib/ceph/$type/$cluster-$id, where $type is osd or mon.
And this is what is in the wip-defaults branch now, it seems.

Bernard Grymonpon <bernard@xxxxxxxxxxxx>:
> As a osd consists of data and the journal, it should stay together, with all info for \
> that one osd in one place:
>
> I would suggest
>
> /var/lib/ceph/osd/$id/data
> and
> /var/lib/ceph/osd/$id/journal

Journal can live inside .../data (when just a file on the same spindle
is ok), we don't need to use a directory tree level just for that.

> ($id could be replaced by $uuid or $name, for which I would prefer $uuid)

In some ways $uuid is cleaner, but this is something that is too
visible for admins, and the $id space still exists and cannot tolerate
collisions, so we might as well use those.

Andrey Korolyov <andrey@xxxxxxx>:
> Right, but probably we need journal separation at the directory level
> by default, because there is a very small amount of cases when speed
> of main storage is sufficient for journal or when resulting speed
> decrease is not significant, so journal by default may go into
> /var/lib/ceph/osd/journals/$i/journal where osd/journals mounted on
> the fast disk.

Journals as files in a single, separate, dedicated filesystem (most
likely on SSD) is on my list of use cases to be supported. I don't
think we need to always use /var/lib/ceph/osd/journals just to support
that; I think I can arrange the support without that clumsiness.
Details are still pending.

I think many people with that hardware setup will choose to just GPT
partition the SSD, and have a lot less code in the way of the IO.

Bernard Grymonpon <bernard@xxxxxxxxxxxx>:
> I feel it's up to the sysadmin to mount / symlink the correct storage devices on the \
> correct paths - ceph should not be concerned that some volumes might need to sit \
> together.

I think we agree on intents, but I disagree strongly with the words
you chose. The sysadmin should not need to symlink anything. We will
bring up a whole cluster, and enable you to manage that easily.
Anything less is a failure. You may choose to opt out of some of that
higher smarts, if you want to do it differently, but we will need to
provide that. Managing a big cluster without automation is painful,
and people shouldn't need to reinvent the wheel.

Now, whether that automation uses symlinks to point things to the
right places or not, that's almost an implementation detail.

Bernard Grymonpon <bernard@xxxxxxxxxxxx>:
> I assume most OSD nodes will normally run a single OSD, so this would not apply to \
> most nodes.

Expected use right now is 1 OSD per hard drive, 8-12 hard drives per
server. That's what we'll be benchmarking on, primarily.

Wido den Hollander <wido@xxxxxxxxx>:
> I think that's a wrong assumption. On most systems I think multiple OSDs
> will exist, it's debatable if one would run OSDs from different clusters
> very often.

I expect mixing clusters to be rare, but that use case has been made
strongly enough that it seems we will support it as a first-class
feature.

> I'm currently using: osd data = /var/lib/ceph/$name
>
> To get back to what sage mentioned, why add the "-data" suffix to a
> directory name? Isn't it obvious that a directory will contain data?

He was separating osd-data from osd-journal. Since that we've
simplified that to /var/lib/ceph/$type/$cluster-$id, which is close to
what you have, but 1) separating osd data dirs from rest of
/var/lib/ceph, for future expansion room 2) adding the cluster name.

> As I think it is a very specific scenario where a machine would be
> participating in multiple Ceph clusters I'd vote for:
>
>   /var/lib/ceph/$type/$id

I really want to avoid having two different cases, two different code
paths to test, a more rare variant that can break without being
noticed. I want everything to always look the same. "ceph-" seems a
small enough price to pay for that.

Bernard Grymonpon <bernard@xxxxxxxxxxxx>:
> If it is recommended setup to have multiple OSDs per node (like, one OSD per physical \
> drive), then we need to take that in account - but don't assume that one node only \
> has one SSD disk for journals, which would be shared between all OSDs...

As I tried to explain in an earlier "braindump" email, we'll support journals

1. inside osd data
2. on shared ssd as file
3. separate block dev (e.g. ssd or raid 2nd lun with different config)

and find the right journal automagically, by matching uuids. This is
what I'm working on right now.

Bernard Grymonpon <bernard@xxxxxxxxxxxx>:
> I would suggest you fail the startup of the daemon, as it doesn't have all the needed \
> parts - I personally don't like these "autodiscover" thingies, you never know why \
> they are waiting/searching for,...

If you don't like hotplug, you can disable it.

Having hundreds of machines with ~10 disks each and the data center
being remote and managed by people you've never seen will probably
make you like the automation more. Having failed disks turn on an LED,
ops swapping in a new drive from a dedicated pile of pre-prepared
spares, and things just.. working.. is the goal.

> Say that we duplicate a node, for some testing/failover/... I would not
> want to daemon to automatically start, just because the data is there...

If you do that, just turn hotplug off before you plug the disks in to
the replica. There's not much you can do with it, though -- starting
the osds is a no-no, in this case.

I would argue that random copying of osd data disks or journals is an
accident waiting to happen. But there's nothing inherent in the design
to say you can't do that. Just don't do hotplug.

We can't rely on hotplug working anyway. There will *always* be a
change for the admin to manually say "hey, here's a new block device
for you, see if there's something to run there".
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html