Re: Why the change from ceph-disk to ceph-volume and lvm? (and just not stick with direct disk access)

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 8 Jun 2018 13:28:40 +0000 (UTC)

On Fri, 8 Jun 2018, Alfredo Deza wrote:
> On Fri, Jun 8, 2018 at 8:13 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > I'm going to jump in here with a few points.
> >
> > - ceph-disk was replaced for two reasons: (1) It's design was
> > centered around udev, and it was terrible.  We have been plagued for years
> > with bugs due to race conditions in the udev-driven activation of OSDs,
> > mostly variations of "I rebooted and not all of my OSDs started."  It's
> > horrible to observe and horrible to debug. (2) It was based on GPT
> > partitions, lots of people had block layer tools they wanted to use
> > that were LVM-based, and the two didn't mix (no GPT partitions on top of
> > LVs).
> >
> > - We designed ceph-volome to be *modular* because antipicate that there
> > are going to be lots of ways that people provision the hardware devices
> > that we need to consider.  There are already two: legacy ceph-disk devices
> > that are still in use and have GPT partitions (handled by 'simple'), and
> > lvm.  SPDK devices where we manage NVMe devices directly from userspace
> > are on the immediate horizon--obviously LVM won't work there since the
> > kernel isn't involved at all.  We can add any other schemes we like.
> >
> > - If you don't like LVM (e.g., because you find that there is a measurable
> > overhead), let's design a new approach!  I wouldn't bother unless you can
> > actually measure an impact.  But if you can demonstrate a measurable cost,
> > let's do it.
> >
> > - LVM was chosen as the default appraoch for new devices are a few
> > reasons:
> >   - It allows you to attach arbitrary metadata do each device, like which
> > cluster uuid it belongs to, which osd uuid it belongs to, which type of
> > device it is (primary, db, wal, journal), any secrets needed to fetch it's
> > decryption key from a keyserver (the mon by default), and so on.
> >   - One of the goals was to enable lvm-based block layer modules beneath
> > OSDs (dm-cache).  All of the other devicemapper-based tools we are
> > aware of work with LVM.  It was a hammer that hit all nails.
> >
> > - The 'simple' mode is the current 'out' that avoids using LVM if it's not
> > an option for you.  We only implemented scan and activate because that was
> > all that we saw a current need for.  It should be quite easy to add the
> > ability to create new OSDs.
> >
> > I would caution you, though, that simple relies on a file in /etc/ceph
> > that has the metadata about the devices.  If you lose that file you need
> > to have some way to rebuild it or we won't know what to do with your
> > devices.
> > That means you should make the devices self-describing in some
> > way... not, say, a raw device with dm-crypt layered directly on top, or
> > some other option that makes it impossible to tell what it is.  As long as
> > you can implement 'scan' and get any other info you need (e.g., whatever
> > is necessary to fetch decryption keys) then great.
> 
> 'scan'  allows you to recreate that file from a data device or from
> an OSD directory (e.g. /var/lib/ceph/osd/ceph-0/)
> 
> So even in the case of disaster (or migrating) we can still get that
> file again. This includes the ability to detect both ceph-disks's
> encryption support
> as well as regular OSDs.
> 
> Do you mean that there might be situations where we 'scan' wouldn't be
> able to recreate this file? I think the out would be if the OSD is
> mounted/available already.

Right, it works great for the GPT-style ceph-disk devices.

I'm just cautioning that if someone wants to implement a *new* mode that 
doesn't use lvm or the legacy ceph-disk scheme and "uses raw devices for 
lower overhead" (whatever that ends up meaning), it should be done in a 
way such that scan can be implemented.

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com