Re: Why the change from ceph-disk to ceph-volume and lvm? (and just not stick with direct disk access)

Alfredo Deza <adeza@xxxxxxxxxx> · Fri, 8 Jun 2018 08:29:50 -0400

On Fri, Jun 8, 2018 at 8:13 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> I'm going to jump in here with a few points.
>
> - ceph-disk was replaced for two reasons: (1) It's design was
> centered around udev, and it was terrible.  We have been plagued for years
> with bugs due to race conditions in the udev-driven activation of OSDs,
> mostly variations of "I rebooted and not all of my OSDs started."  It's
> horrible to observe and horrible to debug. (2) It was based on GPT
> partitions, lots of people had block layer tools they wanted to use
> that were LVM-based, and the two didn't mix (no GPT partitions on top of
> LVs).
>
> - We designed ceph-volome to be *modular* because antipicate that there
> are going to be lots of ways that people provision the hardware devices
> that we need to consider.  There are already two: legacy ceph-disk devices
> that are still in use and have GPT partitions (handled by 'simple'), and
> lvm.  SPDK devices where we manage NVMe devices directly from userspace
> are on the immediate horizon--obviously LVM won't work there since the
> kernel isn't involved at all.  We can add any other schemes we like.
>
> - If you don't like LVM (e.g., because you find that there is a measurable
> overhead), let's design a new approach!  I wouldn't bother unless you can
> actually measure an impact.  But if you can demonstrate a measurable cost,
> let's do it.
>
> - LVM was chosen as the default appraoch for new devices are a few
> reasons:
>   - It allows you to attach arbitrary metadata do each device, like which
> cluster uuid it belongs to, which osd uuid it belongs to, which type of
> device it is (primary, db, wal, journal), any secrets needed to fetch it's
> decryption key from a keyserver (the mon by default), and so on.
>   - One of the goals was to enable lvm-based block layer modules beneath
> OSDs (dm-cache).  All of the other devicemapper-based tools we are
> aware of work with LVM.  It was a hammer that hit all nails.
>
> - The 'simple' mode is the current 'out' that avoids using LVM if it's not
> an option for you.  We only implemented scan and activate because that was
> all that we saw a current need for.  It should be quite easy to add the
> ability to create new OSDs.
>
> I would caution you, though, that simple relies on a file in /etc/ceph
> that has the metadata about the devices.  If you lose that file you need
> to have some way to rebuild it or we won't know what to do with your
> devices.
> That means you should make the devices self-describing in some
> way... not, say, a raw device with dm-crypt layered directly on top, or
> some other option that makes it impossible to tell what it is.  As long as
> you can implement 'scan' and get any other info you need (e.g., whatever
> is necessary to fetch decryption keys) then great.

'scan'  allows you to recreate that file from a data device or from
an OSD directory (e.g. /var/lib/ceph/osd/ceph-0/)

So even in the case of disaster (or migrating) we can still get that
file again. This includes the ability to detect both ceph-disks's
encryption support
as well as regular OSDs.

Do you mean that there might be situations where we 'scan' wouldn't be
able to recreate this file? I think the out would be if the OSD is
mounted/available already.

>
> sage
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com