Re: ceph-volume and automatic OSD provisioning

Alfredo Deza <adeza@xxxxxxxxxx> · Wed, 20 Jun 2018 09:21:30 -0400

On Wed, Jun 20, 2018 at 8:36 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Tue, 19 Jun 2018, Alfredo Deza wrote:
>> One of the top questions for ceph-volume has been "why this doesn't create
>> partitions like ceph-disk does?". Although we have initially focused on LVM,
>> the same question is true (except for LVs instead of partitions). Now
>> that ceph-volume is
>> stabilizing, we can expand on a more user-friendly approach.
>>
>> We are planning on creating an interface to size devices automatically based on
>> some simple criteria. There are three distinct use cases that we are going to
>> support, that should allow easy OSD provisioning with defaults, to more
>> esoteric use cases with third-party systems (like rook, ceph-ansible, seasalt,
>> etc...)
>>
>> This is being implemented as a separate sub-command to avoid pilling up the
>> complexity on the existing `lvm` one, and reflect the automation behind it.
>>
>> Here are some examples on how the API is being designed, for fully automatic
>> configuration, semi-automatic (allows input), and manual via a config
>> management system:
>>
>> Automatic (no configuration or options required):
>> -------------------------------------------------
>>
>> Single device type:
>>
>>     $ ceph-volume auto
>>      Use --yes to run
>>      Detected devices:
>>        [rotational] /dev/sda    1TB
>>        [rotational] /dev/sdb    1TB
>>        [rotational] /dev/sdc    1TB
>>
>>      Expected Bluestore OSDs:
>>
>>       data: /dev/sda (100%)
>>       data: /dev/sdb (100%)
>>       data: /dev/sdc (100%)
>>
>> This scenario will detect a single type of unused device (rotational)
>> so the bluestore
>> OSD will be created on each without block.db or block.wal
>>
>>
>> Mixed devices:
>>
>>     $ ceph-volume auto
>>      Use --yes to run
>>      Detected devices:
>>        [rotational] /dev/sda    1TB
>>        [rotational] /dev/sdb    1TB
>>        [rotational] /dev/sdc    1TB
>>        [solid     ] /dev/sdd    500GB
>>
>>      Expected Bluestore OSDs:
>>
>>       data: /dev/sda (100%), block.db: /dev/sdd (33%)
>>       data: /dev/sdb (100%), block.db: /dev/sdd (33%)
>>       data: /dev/sdc (100%), block.db: /dev/sdd (33%)
>>
>> This scenario will detect the unused devices in the system and understand that
>> there is a mix of solid and rotational devices, will place block on the
>> rotational ones, and will split the ssd in as many rotational devices found (3
>> in this case).
>>
>>
>> Semi configurable outcome, with input:
>> --------------------------------------
>> A user might not want to consume the devices that were automatically detected
>> in the system as free, so the interface will allow to pass these devices
>> directly as input.
>>
>>     $ ceph-volume auto /dev/sda /dev/sdb /dev/sdc
>>      Device information:
>>        [rotational] /dev/sda    1TB
>>        [rotational] /dev/sdb    1TB
>>        [rotational] /dev/sdc    1TB
>>
>>      Expected Bluestore OSDs:
>>
>>       data: /dev/sda (100%), block.db: /dev/sdd (33%)
>>       data: /dev/sdb (100%), block.db: /dev/sdd (33%)
>>       data: /dev/sdc (100%), block.db: /dev/sdd (33%)
>>
>>     Please hit Enter to continue, or Ctrl-C to cancel
>>
>> Similarly, for mixed devices:
>>
>>     $ ceph-volume auto /dev/sda /dev/sdb /dev/sdc /dev/sdd
>>      Use --yes to run
>>      Device information:
>>        [rotational] /dev/sda    1TB
>>        [rotational] /dev/sdb    1TB
>>        [rotational] /dev/sdc    1TB
>>        [solid     ] /dev/sdd    500GB
>>
>>      Expected Bluestore OSDs:
>>
>>       data: /dev/sda (100%), block.db: /dev/sdd (33%)
>>       data: /dev/sdb (100%), block.db: /dev/sdd (33%)
>>       data: /dev/sdc (100%), block.db: /dev/sdd (33%)
>>
>>     Please hit Enter to continue, or Ctrl-C to cancel
>
> I think these two scenarios are the most important because there is
> ambiguity in what the tool should do and the user needs to provide
> some (high-level guidance): do we want distinct pools of devices by type
> (HDD OSDs and SSD OSDs), or do we want to combine devices for "hybrid"
> OSDs (each OSD uses an HDD and part of an SSD).
>
> I have two alternative proposals for framing this:
>
> 1) Drop the full 'auto' mode at the top and *only* provide this mode,
> where a list of devices is provided, because I'm not sure we can have an
> opinion about how to combine (or not combine) the devices.  In contract,
> if we are told to provision sd{a,b,c,d} as a batch, then we *can* have an
> opinion about how to best combine those devices.  (Today, that is a
> trivial opinion: carve sdd into 4 parts; tomorrow, it might be more
> nuanced).

I think these two modes are opinionated for sure, and when we aren't
doing quite what is desired there are two options:

* manually (!) create your LVs and pass them on normally
* use a higher level tool (config management system) to specify the outcome

Even on this mode that you prefer, the combination (or lack of) can
happen, and the logic to deal with these would be the same.

Is your concern that we would get it terribly wrong and users would
try to default to a fully-auto mode instead of the semi-automatic
where
input can be provided?

>
> The command set could instead by something like
>
>  $ ceph-volume discover-unused-devices
>  { 'sda': {'rotational': 1, ...},
>    'sdb': ...
>  }
>
> This command would codify checks for existing file systems, multipath
> workarounds, and all the other weird issues that the ceph-ansible
> folks have learned about avoiding in-use devices.

Yes, we already have some checks in place, and Sebastien mentioned a
couple more that should be added.

>
> Then there would be a second command that tells the user what it would do,
>
>  $ ceph-volume plan-batch <device list>
>  ...
>
> And finally the command that does it,
>
>  $ ceph-volume prepare-batch <device list>
>  $ ceph-volume prepare-batch <device list 2>  # if there are 2 classes of osd

I think this is very similar though, all of the examples I provided
have output for what the outcome of a list would be (except for the
JSON one). So in theory,
yes, you could tinker with the tool, passing a bunch of different
disks each time and looking at what the outcome would be.

Something of value for external systems may be what you mentioned a
while ago which is a dry-run mode that can output user-friendly
information but JSON as well
so tools like the dashboard can consume that to report back to a user.
All of this *aside* from what I showed here in my examples.

>
> One nice thing about this approach is that the user (either a human or
> ansible or some other tool) is in the middle making the call about how to
> group devices, which means that in the mixed HDD/SSD case they are making
> the choice about whether to make two kinds of OSDs or hybrid OSDs.
>
> 2) Instead of running the tool twice, run it once and pass a flag
> indicating that multiple classes of OSD should be created.  Maybe
> something like
>
>  $ ceph-volume prepare-batch --uniform <device list>
>  $ ceph-volume prepare-batch --multi-class <device list>

I see, you are thinking that a user may want to have a "uniform"
strategy even for different device types (rotational and solid) ?

I would prefer to be opinionated :) But I see how just a few options
can help here

>
> The one scenario that comes to mind that option 1 *doesn't* cover is a bit
> uncommon, but might be worth thinking about: a host where we have an NVMe
> and want to part of it for journals/db partitions and part of it as a
> standalone SSD.  For example,
>
>        data: /dev/sda (100%), block.db: /dev/sdd (20%)
>        data: /dev/sdb (100%), block.db: /dev/sdd (20%)
>        data: /dev/sdc (100%), block.db: /dev/sdd (20%)
>        data: /dev/sdd (40%)
>
> Letting the tool do this batching in some wonky way (with options) might
> let us do something like the above in an easy way.  Maybe an argument
> would give the tool some guidance for how much of the SSD-only class is
> needed.
>
> I'm not really convinced it is a good idea to go this path, but it might
> give us more flexibility to do more later.  I'm having a hard time
> imagining how we can make good decisions here without lots of hints from
> the user, like "this will be an archival workload," and it seems like that
> type of guidance might be better enshrined in a tool or command set
> layered on top of this one.
>
> Thoughts?

The burden for uncommon configurations have to go to higher level
tooling I think. These approaches aren't always
going to get it right, and that is why we are providing a far more
configurable input for other systems to provide.

A problem with uncommon setups like the one you described is that it
opens the possibility for users to ask "why is this (odd) setup
supported
but not this other one that I need?"

>
>> Fully Manual (config management systems):
>> -----------------------------------------
>> A JSON file or a blob as a positional arugment would allow fine tunning other
>> specifics, like using 2 OSDs per NVMe device, determine an exact size for
>> a block.db or even a block.wal LV.
>>
>>     $ ceph-volume auto /etc/ceph/custom_osd_provisioning.json
>>
>> Or:
>>
>>     $ ceph-volume auto "{ ... }"
>>
>>
>> Here the API is still undefined as of now, but the idea is to expand on more
>> complex setups that can be better managed by configuration management systems
>
> Is the idea here that the input would be something like the percentages
> you have above, and maybe some flags?  That seems reasonably general to me
> and I'm not sure what else we might need.  Flags might be something like
> "use dmcrypt" or "use VDO" or whatever.

There are tons of variations here that would be hard/awkward to
support them with the other modes
(some of them supported by 'create'):

* how many OSDs per device
* Optional block.wal or block.db, on what device and at what size
* what device to use for block (data)
* enable dmcrypt
* skip systemd creation
* bluestore or filestore
* crush device class

>
> If we do implement this, what if the output of the "plan" command in 1 is
> the input for this command?  (And any "auto" command just strings the two
> of them together in one invocation?)

Are you thinking this as user-facing? I'm inclined to say this is not
straightforward enough for users, but higher tooling
can definitely put this together (programmatically) with their
understanding at the time of provisioning.

Sebastien (and other tooling people like rook/seasalt/dashboard):
maybe you can hint us here on what makes a bit more sense to you?

>
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html