Re: Why the change from ceph-disk to ceph-volume and lvm? (and just not stick with direct disk access)

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Tue, 17 Jul 2018 14:34:28 +0200

I still wanted to thank you for the nicely detailed arguments regarding 
this, it is much appreciated. It really gives me the broader perspective 
I was lacking. 

-----Original Message-----
From: Warren Wang [mailto:Warren.Wang@xxxxxxxxxxx] 
Sent: maandag 11 juni 2018 17:30
To: Konstantin Shalygin; ceph-users@xxxxxxxxxxxxxx; Marc Roos
Subject: Re:  Why the change from ceph-disk to ceph-volume 
and lvm? (and just not stick with direct disk access)

I'll chime in as a large scale operator, and a strong proponent of 
ceph-volume.
Ceph-disk wasn't accomplishing what was needed with anything other than 
vanilla use cases (even then, still kind of broken). I'm not going to 
re-hash Sage's valid points too much, but trying to manipulate the old 
ceph-disk to work with your own LVM (or other block manager). As far as 
the pain of doing something new goes, yes, sometimes moving to newer 
more flexible methods results in a large amount of work. Trust me, I 
feel that pain when we're talking about things like ceph-volume, 
bluestore, etc, but these changes are not made without reason.

As far as LVM performance goes, I think that's well understood in the 
larger Linux community. We accept that minimal overhead to accomplish 
some of the setups that we're interested in, such as encrypted, 
lvm-cached OSDs. The above is not a trivial thing to do using ceph-disk. 
We know, we run that in production, at large scale. It's plagued with 
problems, and since it's done without Ceph itself, it is difficult to 
tie the two together. Having it managed directly by Ceph, via 
ceph-volume makes much more sense. 
We're not alone in this, so I know it will benefit others as well, at 
the cost of technical expertise.

There are maintainers now for ceph-volume, so if there's something you 
don't like, I suggest proposing a change. 

Warren Wang

On 6/8/18, 11:05 AM, "ceph-users on behalf of Konstantin Shalygin" 
<ceph-users-bounces@xxxxxxxxxxxxxx on behalf of k0ste@xxxxxxxx> wrote:

    > - ceph-disk was replaced for two reasons: (1) It's design was
    > centered around udev, and it was terrible.  We have been plagued 
for years
    > with bugs due to race conditions in the udev-driven activation of 
OSDs,
    > mostly variations of "I rebooted and not all of my OSDs started."  
It's
    > horrible to observe and horrible to debug. (2) It was based on GPT
    > partitions, lots of people had block layer tools they wanted to 
use
    > that were LVM-based, and the two didn't mix (no GPT partitions on 
top of
    > LVs).
    >
    > - We designed ceph-volome to be *modular* because antipicate that 
there
    > are going to be lots of ways that people provision the hardware 
devices
    > that we need to consider.  There are already two: legacy ceph-disk 
devices
    > that are still in use and have GPT partitions (handled by 
'simple'), and
    > lvm.  SPDK devices where we manage NVMe devices directly from 
userspace
    > are on the immediate horizon--obviously LVM won't work there since 
the
    > kernel isn't involved at all.  We can add any other schemes we 
like.
    >
    > - If you don't like LVM (e.g., because you find that there is a 
measurable
    > overhead), let's design a new approach!  I wouldn't bother unless 
you can
    > actually measure an impact.  But if you can demonstrate a 
measurable cost,
    > let's do it.
    >
    > - LVM was chosen as the default appraoch for new devices are a few
    > reasons:
    >    - It allows you to attach arbitrary metadata do each device, 
like which
    > cluster uuid it belongs to, which osd uuid it belongs to, which 
type of
    > device it is (primary, db, wal, journal), any secrets needed to 
fetch it's
    > decryption key from a keyserver (the mon by default), and so on.
    >    - One of the goals was to enable lvm-based block layer modules 
beneath
    > OSDs (dm-cache).  All of the other devicemapper-based tools we are
    > aware of work with LVM.  It was a hammer that hit all nails.
    >
    > - The 'simple' mode is the current 'out' that avoids using LVM if 
it's not
    > an option for you.  We only implemented scan and activate because 
that was
    > all that we saw a current need for.  It should be quite easy to 
add the
    > ability to create new OSDs.
    >
    > I would caution you, though, that simple relies on a file in 
/etc/ceph
    > that has the metadata about the devices.  If you lose that file 
you need
    > to have some way to rebuild it or we won't know what to do with 
your
    > devices.  That means you should make the devices self-describing 
in some
    > way... not, say, a raw device with dm-crypt layered directly on 
top, or
    > some other option that makes it impossible to tell what it is.  As 
long as
    > you can implement 'scan' and get any other info you need (e.g., 
whatever
    > is necessary to fetch decryption keys) then great.

    Thanks, I got what I wanted. It was in this form that it was 
necessary 
    to submit deprecations to the community: "why do we do this, and 
what 
    will it give us." As it was presented: "We kill the tool along with 
its 
    functionality, you should use the new one as is, even if you do not 
know 
    what it does."

    Thanks again, Sage. I think this post should be in ceph blog.

    k

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com