Re: Why the change from ceph-disk to ceph-volume and lvm? (and just not stick with direct disk access)

Konstantin Shalygin <k0ste@xxxxxxxx> · Fri, 8 Jun 2018 22:04:41 +0700

- ceph-disk was replaced for two reasons: (1) It's design was
centered around udev, and it was terrible.  We have been plagued for years
with bugs due to race conditions in the udev-driven activation of OSDs,
mostly variations of "I rebooted and not all of my OSDs started."  It's
horrible to observe and horrible to debug. (2) It was based on GPT
partitions, lots of people had block layer tools they wanted to use
that were LVM-based, and the two didn't mix (no GPT partitions on top of
LVs).

- We designed ceph-volome to be *modular* because antipicate that there
are going to be lots of ways that people provision the hardware devices
that we need to consider.  There are already two: legacy ceph-disk devices
that are still in use and have GPT partitions (handled by 'simple'), and
lvm.  SPDK devices where we manage NVMe devices directly from userspace
are on the immediate horizon--obviously LVM won't work there since the
kernel isn't involved at all.  We can add any other schemes we like.

- If you don't like LVM (e.g., because you find that there is a measurable
overhead), let's design a new approach!  I wouldn't bother unless you can
actually measure an impact.  But if you can demonstrate a measurable cost,
let's do it.

- LVM was chosen as the default appraoch for new devices are a few
reasons:
   - It allows you to attach arbitrary metadata do each device, like which
cluster uuid it belongs to, which osd uuid it belongs to, which type of
device it is (primary, db, wal, journal), any secrets needed to fetch it's
decryption key from a keyserver (the mon by default), and so on.
   - One of the goals was to enable lvm-based block layer modules beneath
OSDs (dm-cache).  All of the other devicemapper-based tools we are
aware of work with LVM.  It was a hammer that hit all nails.

- The 'simple' mode is the current 'out' that avoids using LVM if it's not
an option for you.  We only implemented scan and activate because that was
all that we saw a current need for.  It should be quite easy to add the
ability to create new OSDs.

I would caution you, though, that simple relies on a file in /etc/ceph
that has the metadata about the devices.  If you lose that file you need
to have some way to rebuild it or we won't know what to do with your
devices.  That means you should make the devices self-describing in some
way... not, say, a raw device with dm-crypt layered directly on top, or
some other option that makes it impossible to tell what it is.  As long as
you can implement 'scan' and get any other info you need (e.g., whatever
is necessary to fetch decryption keys) then great.

Thanks, I got what I wanted. It was in this form that it was necessary 
to submit deprecations to the community: "why do we do this, and what 
will it give us." As it was presented: "We kill the tool along with its 
functionality, you should use the new one as is, even if you do not know 
what it does."

Thanks again, Sage. I think this post should be in ceph blog.

k

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com