Re: ceph-volume and automatic OSD provisioning

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 20 Jun 2018 12:36:37 +0000 (UTC)

On Tue, 19 Jun 2018, Alfredo Deza wrote:
> One of the top questions for ceph-volume has been "why this doesn't create
> partitions like ceph-disk does?". Although we have initially focused on LVM,
> the same question is true (except for LVs instead of partitions). Now
> that ceph-volume is
> stabilizing, we can expand on a more user-friendly approach.
> 
> We are planning on creating an interface to size devices automatically based on
> some simple criteria. There are three distinct use cases that we are going to
> support, that should allow easy OSD provisioning with defaults, to more
> esoteric use cases with third-party systems (like rook, ceph-ansible, seasalt,
> etc...)
> 
> This is being implemented as a separate sub-command to avoid pilling up the
> complexity on the existing `lvm` one, and reflect the automation behind it.
> 
> Here are some examples on how the API is being designed, for fully automatic
> configuration, semi-automatic (allows input), and manual via a config
> management system:
> 
> Automatic (no configuration or options required):
> -------------------------------------------------
> 
> Single device type:
> 
>     $ ceph-volume auto
>      Use --yes to run
>      Detected devices:
>        [rotational] /dev/sda    1TB
>        [rotational] /dev/sdb    1TB
>        [rotational] /dev/sdc    1TB
> 
>      Expected Bluestore OSDs:
> 
>       data: /dev/sda (100%)
>       data: /dev/sdb (100%)
>       data: /dev/sdc (100%)
> 
> This scenario will detect a single type of unused device (rotational)
> so the bluestore
> OSD will be created on each without block.db or block.wal
> 
> 
> Mixed devices:
> 
>     $ ceph-volume auto
>      Use --yes to run
>      Detected devices:
>        [rotational] /dev/sda    1TB
>        [rotational] /dev/sdb    1TB
>        [rotational] /dev/sdc    1TB
>        [solid     ] /dev/sdd    500GB
> 
>      Expected Bluestore OSDs:
> 
>       data: /dev/sda (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdb (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdc (100%), block.db: /dev/sdd (33%)
> 
> This scenario will detect the unused devices in the system and understand that
> there is a mix of solid and rotational devices, will place block on the
> rotational ones, and will split the ssd in as many rotational devices found (3
> in this case).
> 
> 
> Semi configurable outcome, with input:
> --------------------------------------
> A user might not want to consume the devices that were automatically detected
> in the system as free, so the interface will allow to pass these devices
> directly as input.
> 
>     $ ceph-volume auto /dev/sda /dev/sdb /dev/sdc
>      Device information:
>        [rotational] /dev/sda    1TB
>        [rotational] /dev/sdb    1TB
>        [rotational] /dev/sdc    1TB
> 
>      Expected Bluestore OSDs:
> 
>       data: /dev/sda (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdb (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdc (100%), block.db: /dev/sdd (33%)
> 
>     Please hit Enter to continue, or Ctrl-C to cancel
> 
> Similarly, for mixed devices:
> 
>     $ ceph-volume auto /dev/sda /dev/sdb /dev/sdc /dev/sdd
>      Use --yes to run
>      Device information:
>        [rotational] /dev/sda    1TB
>        [rotational] /dev/sdb    1TB
>        [rotational] /dev/sdc    1TB
>        [solid     ] /dev/sdd    500GB
> 
>      Expected Bluestore OSDs:
> 
>       data: /dev/sda (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdb (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdc (100%), block.db: /dev/sdd (33%)
> 
>     Please hit Enter to continue, or Ctrl-C to cancel

I think these two scenarios are the most important because there is 
ambiguity in what the tool should do and the user needs to provide 
some (high-level guidance): do we want distinct pools of devices by type 
(HDD OSDs and SSD OSDs), or do we want to combine devices for "hybrid" 
OSDs (each OSD uses an HDD and part of an SSD).

I have two alternative proposals for framing this:

1) Drop the full 'auto' mode at the top and *only* provide this mode, 
where a list of devices is provided, because I'm not sure we can have an 
opinion about how to combine (or not combine) the devices.  In contract, 
if we are told to provision sd{a,b,c,d} as a batch, then we *can* have an 
opinion about how to best combine those devices.  (Today, that is a 
trivial opinion: carve sdd into 4 parts; tomorrow, it might be more 
nuanced).

The command set could instead by something like

 $ ceph-volume discover-unused-devices
 { 'sda': {'rotational': 1, ...},
   'sdb': ...
 }

This command would codify checks for existing file systems, multipath 
workarounds, and all the other weird issues that the ceph-ansible 
folks have learned about avoiding in-use devices.

Then there would be a second command that tells the user what it would do,

 $ ceph-volume plan-batch <device list>
 ...

And finally the command that does it,

 $ ceph-volume prepare-batch <device list>
 $ ceph-volume prepare-batch <device list 2>  # if there are 2 classes of osd

One nice thing about this approach is that the user (either a human or 
ansible or some other tool) is in the middle making the call about how to 
group devices, which means that in the mixed HDD/SSD case they are making 
the choice about whether to make two kinds of OSDs or hybrid OSDs.

2) Instead of running the tool twice, run it once and pass a flag 
indicating that multiple classes of OSD should be created.  Maybe 
something like

 $ ceph-volume prepare-batch --uniform <device list>
 $ ceph-volume prepare-batch --multi-class <device list>

The one scenario that comes to mind that option 1 *doesn't* cover is a bit 
uncommon, but might be worth thinking about: a host where we have an NVMe 
and want to part of it for journals/db partitions and part of it as a 
standalone SSD.  For example,

       data: /dev/sda (100%), block.db: /dev/sdd (20%)
       data: /dev/sdb (100%), block.db: /dev/sdd (20%)
       data: /dev/sdc (100%), block.db: /dev/sdd (20%)
       data: /dev/sdd (40%)

Letting the tool do this batching in some wonky way (with options) might 
let us do something like the above in an easy way.  Maybe an argument 
would give the tool some guidance for how much of the SSD-only class is 
needed.

I'm not really convinced it is a good idea to go this path, but it might 
give us more flexibility to do more later.  I'm having a hard time 
imagining how we can make good decisions here without lots of hints from 
the user, like "this will be an archival workload," and it seems like that 
type of guidance might be better enshrined in a tool or command set 
layered on top of this one.

Thoughts?

> Fully Manual (config management systems):
> -----------------------------------------
> A JSON file or a blob as a positional arugment would allow fine tunning other
> specifics, like using 2 OSDs per NVMe device, determine an exact size for
> a block.db or even a block.wal LV.
> 
>     $ ceph-volume auto /etc/ceph/custom_osd_provisioning.json
> 
> Or:
> 
>     $ ceph-volume auto "{ ... }"
> 
> 
> Here the API is still undefined as of now, but the idea is to expand on more
> complex setups that can be better managed by configuration management systems

Is the idea here that the input would be something like the percentages 
you have above, and maybe some flags?  That seems reasonably general to me 
and I'm not sure what else we might need.  Flags might be something like 
"use dmcrypt" or "use VDO" or whatever.

If we do implement this, what if the output of the "plan" command in 1 is 
the input for this command?  (And any "auto" command just strings the two 
of them together in one invocation?)

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html