On Fri, 8 Jun 2018, Alfredo Deza wrote: > On Fri, Jun 8, 2018 at 8:13 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > I'm going to jump in here with a few points. > > > > - ceph-disk was replaced for two reasons: (1) It's design was > > centered around udev, and it was terrible. We have been plagued for years > > with bugs due to race conditions in the udev-driven activation of OSDs, > > mostly variations of "I rebooted and not all of my OSDs started." It's > > horrible to observe and horrible to debug. (2) It was based on GPT > > partitions, lots of people had block layer tools they wanted to use > > that were LVM-based, and the two didn't mix (no GPT partitions on top of > > LVs). > > > > - We designed ceph-volome to be *modular* because antipicate that there > > are going to be lots of ways that people provision the hardware devices > > that we need to consider. There are already two: legacy ceph-disk devices > > that are still in use and have GPT partitions (handled by 'simple'), and > > lvm. SPDK devices where we manage NVMe devices directly from userspace > > are on the immediate horizon--obviously LVM won't work there since the > > kernel isn't involved at all. We can add any other schemes we like. > > > > - If you don't like LVM (e.g., because you find that there is a measurable > > overhead), let's design a new approach! I wouldn't bother unless you can > > actually measure an impact. But if you can demonstrate a measurable cost, > > let's do it. > > > > - LVM was chosen as the default appraoch for new devices are a few > > reasons: > > - It allows you to attach arbitrary metadata do each device, like which > > cluster uuid it belongs to, which osd uuid it belongs to, which type of > > device it is (primary, db, wal, journal), any secrets needed to fetch it's > > decryption key from a keyserver (the mon by default), and so on. > > - One of the goals was to enable lvm-based block layer modules beneath > > OSDs (dm-cache). All of the other devicemapper-based tools we are > > aware of work with LVM. It was a hammer that hit all nails. > > > > - The 'simple' mode is the current 'out' that avoids using LVM if it's not > > an option for you. We only implemented scan and activate because that was > > all that we saw a current need for. It should be quite easy to add the > > ability to create new OSDs. > > > > I would caution you, though, that simple relies on a file in /etc/ceph > > that has the metadata about the devices. If you lose that file you need > > to have some way to rebuild it or we won't know what to do with your > > devices. > > That means you should make the devices self-describing in some > > way... not, say, a raw device with dm-crypt layered directly on top, or > > some other option that makes it impossible to tell what it is. As long as > > you can implement 'scan' and get any other info you need (e.g., whatever > > is necessary to fetch decryption keys) then great. > > 'scan' allows you to recreate that file from a data device or from > an OSD directory (e.g. /var/lib/ceph/osd/ceph-0/) > > So even in the case of disaster (or migrating) we can still get that > file again. This includes the ability to detect both ceph-disks's > encryption support > as well as regular OSDs. > > Do you mean that there might be situations where we 'scan' wouldn't be > able to recreate this file? I think the out would be if the OSD is > mounted/available already. Right, it works great for the GPT-style ceph-disk devices. I'm just cautioning that if someone wants to implement a *new* mode that doesn't use lvm or the legacy ceph-disk scheme and "uses raw devices for lower overhead" (whatever that ends up meaning), it should be done in a way such that scan can be implemented. sage _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com