On Fri, Jun 8, 2018 at 8:13 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > I'm going to jump in here with a few points. > > - ceph-disk was replaced for two reasons: (1) It's design was > centered around udev, and it was terrible. We have been plagued for years > with bugs due to race conditions in the udev-driven activation of OSDs, > mostly variations of "I rebooted and not all of my OSDs started." It's > horrible to observe and horrible to debug. (2) It was based on GPT > partitions, lots of people had block layer tools they wanted to use > that were LVM-based, and the two didn't mix (no GPT partitions on top of > LVs). > > - We designed ceph-volome to be *modular* because antipicate that there > are going to be lots of ways that people provision the hardware devices > that we need to consider. There are already two: legacy ceph-disk devices > that are still in use and have GPT partitions (handled by 'simple'), and > lvm. SPDK devices where we manage NVMe devices directly from userspace > are on the immediate horizon--obviously LVM won't work there since the > kernel isn't involved at all. We can add any other schemes we like. > > - If you don't like LVM (e.g., because you find that there is a measurable > overhead), let's design a new approach! I wouldn't bother unless you can > actually measure an impact. But if you can demonstrate a measurable cost, > let's do it. > > - LVM was chosen as the default appraoch for new devices are a few > reasons: > - It allows you to attach arbitrary metadata do each device, like which > cluster uuid it belongs to, which osd uuid it belongs to, which type of > device it is (primary, db, wal, journal), any secrets needed to fetch it's > decryption key from a keyserver (the mon by default), and so on. > - One of the goals was to enable lvm-based block layer modules beneath > OSDs (dm-cache). All of the other devicemapper-based tools we are > aware of work with LVM. It was a hammer that hit all nails. > > - The 'simple' mode is the current 'out' that avoids using LVM if it's not > an option for you. We only implemented scan and activate because that was > all that we saw a current need for. It should be quite easy to add the > ability to create new OSDs. > > I would caution you, though, that simple relies on a file in /etc/ceph > that has the metadata about the devices. If you lose that file you need > to have some way to rebuild it or we won't know what to do with your > devices. > That means you should make the devices self-describing in some > way... not, say, a raw device with dm-crypt layered directly on top, or > some other option that makes it impossible to tell what it is. As long as > you can implement 'scan' and get any other info you need (e.g., whatever > is necessary to fetch decryption keys) then great. 'scan' allows you to recreate that file from a data device or from an OSD directory (e.g. /var/lib/ceph/osd/ceph-0/) So even in the case of disaster (or migrating) we can still get that file again. This includes the ability to detect both ceph-disks's encryption support as well as regular OSDs. Do you mean that there might be situations where we 'scan' wouldn't be able to recreate this file? I think the out would be if the OSD is mounted/available already. > > sage > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com