On Tue, 19 Jun 2018, Alfredo Deza wrote: > One of the top questions for ceph-volume has been "why this doesn't create > partitions like ceph-disk does?". Although we have initially focused on LVM, > the same question is true (except for LVs instead of partitions). Now > that ceph-volume is > stabilizing, we can expand on a more user-friendly approach. > > We are planning on creating an interface to size devices automatically based on > some simple criteria. There are three distinct use cases that we are going to > support, that should allow easy OSD provisioning with defaults, to more > esoteric use cases with third-party systems (like rook, ceph-ansible, seasalt, > etc...) > > This is being implemented as a separate sub-command to avoid pilling up the > complexity on the existing `lvm` one, and reflect the automation behind it. > > Here are some examples on how the API is being designed, for fully automatic > configuration, semi-automatic (allows input), and manual via a config > management system: > > Automatic (no configuration or options required): > ------------------------------------------------- > > Single device type: > > $ ceph-volume auto > Use --yes to run > Detected devices: > [rotational] /dev/sda 1TB > [rotational] /dev/sdb 1TB > [rotational] /dev/sdc 1TB > > Expected Bluestore OSDs: > > data: /dev/sda (100%) > data: /dev/sdb (100%) > data: /dev/sdc (100%) > > This scenario will detect a single type of unused device (rotational) > so the bluestore > OSD will be created on each without block.db or block.wal > > > Mixed devices: > > $ ceph-volume auto > Use --yes to run > Detected devices: > [rotational] /dev/sda 1TB > [rotational] /dev/sdb 1TB > [rotational] /dev/sdc 1TB > [solid ] /dev/sdd 500GB > > Expected Bluestore OSDs: > > data: /dev/sda (100%), block.db: /dev/sdd (33%) > data: /dev/sdb (100%), block.db: /dev/sdd (33%) > data: /dev/sdc (100%), block.db: /dev/sdd (33%) > > This scenario will detect the unused devices in the system and understand that > there is a mix of solid and rotational devices, will place block on the > rotational ones, and will split the ssd in as many rotational devices found (3 > in this case). > > > Semi configurable outcome, with input: > -------------------------------------- > A user might not want to consume the devices that were automatically detected > in the system as free, so the interface will allow to pass these devices > directly as input. > > $ ceph-volume auto /dev/sda /dev/sdb /dev/sdc > Device information: > [rotational] /dev/sda 1TB > [rotational] /dev/sdb 1TB > [rotational] /dev/sdc 1TB > > Expected Bluestore OSDs: > > data: /dev/sda (100%), block.db: /dev/sdd (33%) > data: /dev/sdb (100%), block.db: /dev/sdd (33%) > data: /dev/sdc (100%), block.db: /dev/sdd (33%) > > Please hit Enter to continue, or Ctrl-C to cancel > > Similarly, for mixed devices: > > $ ceph-volume auto /dev/sda /dev/sdb /dev/sdc /dev/sdd > Use --yes to run > Device information: > [rotational] /dev/sda 1TB > [rotational] /dev/sdb 1TB > [rotational] /dev/sdc 1TB > [solid ] /dev/sdd 500GB > > Expected Bluestore OSDs: > > data: /dev/sda (100%), block.db: /dev/sdd (33%) > data: /dev/sdb (100%), block.db: /dev/sdd (33%) > data: /dev/sdc (100%), block.db: /dev/sdd (33%) > > Please hit Enter to continue, or Ctrl-C to cancel I think these two scenarios are the most important because there is ambiguity in what the tool should do and the user needs to provide some (high-level guidance): do we want distinct pools of devices by type (HDD OSDs and SSD OSDs), or do we want to combine devices for "hybrid" OSDs (each OSD uses an HDD and part of an SSD). I have two alternative proposals for framing this: 1) Drop the full 'auto' mode at the top and *only* provide this mode, where a list of devices is provided, because I'm not sure we can have an opinion about how to combine (or not combine) the devices. In contract, if we are told to provision sd{a,b,c,d} as a batch, then we *can* have an opinion about how to best combine those devices. (Today, that is a trivial opinion: carve sdd into 4 parts; tomorrow, it might be more nuanced). The command set could instead by something like $ ceph-volume discover-unused-devices { 'sda': {'rotational': 1, ...}, 'sdb': ... } This command would codify checks for existing file systems, multipath workarounds, and all the other weird issues that the ceph-ansible folks have learned about avoiding in-use devices. Then there would be a second command that tells the user what it would do, $ ceph-volume plan-batch <device list> ... And finally the command that does it, $ ceph-volume prepare-batch <device list> $ ceph-volume prepare-batch <device list 2> # if there are 2 classes of osd One nice thing about this approach is that the user (either a human or ansible or some other tool) is in the middle making the call about how to group devices, which means that in the mixed HDD/SSD case they are making the choice about whether to make two kinds of OSDs or hybrid OSDs. 2) Instead of running the tool twice, run it once and pass a flag indicating that multiple classes of OSD should be created. Maybe something like $ ceph-volume prepare-batch --uniform <device list> $ ceph-volume prepare-batch --multi-class <device list> The one scenario that comes to mind that option 1 *doesn't* cover is a bit uncommon, but might be worth thinking about: a host where we have an NVMe and want to part of it for journals/db partitions and part of it as a standalone SSD. For example, data: /dev/sda (100%), block.db: /dev/sdd (20%) data: /dev/sdb (100%), block.db: /dev/sdd (20%) data: /dev/sdc (100%), block.db: /dev/sdd (20%) data: /dev/sdd (40%) Letting the tool do this batching in some wonky way (with options) might let us do something like the above in an easy way. Maybe an argument would give the tool some guidance for how much of the SSD-only class is needed. I'm not really convinced it is a good idea to go this path, but it might give us more flexibility to do more later. I'm having a hard time imagining how we can make good decisions here without lots of hints from the user, like "this will be an archival workload," and it seems like that type of guidance might be better enshrined in a tool or command set layered on top of this one. Thoughts? > Fully Manual (config management systems): > ----------------------------------------- > A JSON file or a blob as a positional arugment would allow fine tunning other > specifics, like using 2 OSDs per NVMe device, determine an exact size for > a block.db or even a block.wal LV. > > $ ceph-volume auto /etc/ceph/custom_osd_provisioning.json > > Or: > > $ ceph-volume auto "{ ... }" > > > Here the API is still undefined as of now, but the idea is to expand on more > complex setups that can be better managed by configuration management systems Is the idea here that the input would be something like the percentages you have above, and maybe some flags? That seems reasonably general to me and I'm not sure what else we might need. Flags might be something like "use dmcrypt" or "use VDO" or whatever. If we do implement this, what if the output of the "plan" command in 1 is the input for this command? (And any "auto" command just strings the two of them together in one invocation?) sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html