Re: ceph-volume and automatic OSD provisioning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 21, 2018 at 2:34 PM Erwan Velu <evelu@xxxxxxxxxx> wrote:
>
> The idea of making an automatic configuration is tied with the concept of opiniating what should be the kind of devices to associate.

Yes and no.  The higher level orchestrator (rook/ceph-ansible/deepsea)
should make the decision about which types of devices to group
together.  In most cases this is trivial, because typical systems just
have one type of HDD and one type of SSD.  In more heterogenous cases,
the orchestrator would either apply its own policy or take user input
to say which devices should go together.  From ceph-volume's point of
view, it just gets a list of devices either way.

The following points become simpler to answer once we make this design
choice: that ceph-volume's auto mode is not responsible for deciding
*which* devices to use, or which SSDs go with which HDDs: it just
takes an explicit list of devices and lays out OSDs across those.

> You spoke about using a ratio between SSDs and HDDs to get a good setup.
> What should be the behavior of the tool if the ratio :
> - cannot be reach (not enough HDDs for 1 SSD) ?

That's the easy case: create an extra OSD in the empty space.  If the
user didn't want that, then deleting an OSD later is easy.

> - is exceeded (if we have 1 more HDD than expected, shall it be included or left away alone ?)

If the HDD:SSD ratio exceeds our arbitrary limit, we should not try
and create any DB/WAL on the SSD, and instead just use the SSD as an
OSD.  The users that have a ratio like this but really want to use the
SSDs as DB/WAL would form part of "the 10%" of people for whom the
autoselection doesn't work well.

This is obviously a completely arbitrary behaviour, but that's OK,
it's just a default -- the user gets a preview of what is going to
happen, and can intervene if this is not what they want.

> If we have several SSDs free, which one should be used ?

Pick an arbitrary order, like sorting by their by-path name.  Then
round-robin assign db/wal partitions to the drives.

As above, any leftover space (including SSDs assigned no db/wal
partitions) gets used for pure SSD OSDs.

Again, when the user sees the preview of this they'll have a chance to
realise if they wanted to do something different.

> If we have multiple HDDs types (10/15K/7.2K) how be sure they are used in the same 'auto' setup ?

This is up to the orchestrator to decide before passing list of
devices into ceph-volume -- the orchestrator would either put all HDDs
together in one ceph-volume invocation, or it would take
responsibility for subdividing them.  However, I believe in the vast
majority of cases people are not intentionally using a mixture of HDD
speeds with Ceph -- these days, fast storage means solid state, and
all HDDs are for bulk storage.

> If we have 1 SSD and 1 NVMe, which one is preferred ?

This is up to the orchestrator: it should apply a policy before
passing the device sets into ceph-volume.

The sensible policy in this case would probably be to create two
groups for ceph-volume: the first one with the HDDs and the NVMe (i.e.
the NVMes get sliced up as db/wal), and the second group contains just
the SSDs (they are used as individual OSDs).  This is a sane default,
and the UI/orchestrator would be responsible for providing the user
with any choices about deviating from that.

> What if there is some devices that should not be used by ceph-volume ? Does it imply using the manual mode ?

The orchestrator is responsible for this.  It can take input from the
end user about any devices to blacklist.

> When the user is providing a list of devices, do we agree that they have to be checked against the "rejecting" rules to avoid using a wrong device ?

Probably any rejection/blacklisting can happen before the list of
devices is passed into ceph-volume?  But I'm not sure what kind of
rejecting rules you mean in particular.

John

>
> ----- Mail original -----
> De: "Alfredo Deza" <adeza@xxxxxxxxxx>
> À: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
> Envoyé: Mardi 19 Juin 2018 21:35:02
> Objet: ceph-volume and automatic OSD provisioning
>
> One of the top questions for ceph-volume has been "why this doesn't create
> partitions like ceph-disk does?". Although we have initially focused on LVM,
> the same question is true (except for LVs instead of partitions). Now
> that ceph-volume is
> stabilizing, we can expand on a more user-friendly approach.
>
> We are planning on creating an interface to size devices automatically based on
> some simple criteria. There are three distinct use cases that we are going to
> support, that should allow easy OSD provisioning with defaults, to more
> esoteric use cases with third-party systems (like rook, ceph-ansible, seasalt,
> etc...)
>
> This is being implemented as a separate sub-command to avoid pilling up the
> complexity on the existing `lvm` one, and reflect the automation behind it.
>
> Here are some examples on how the API is being designed, for fully automatic
> configuration, semi-automatic (allows input), and manual via a config
> management system:
>
> Automatic (no configuration or options required):
> -------------------------------------------------
>
> Single device type:
>
>     $ ceph-volume auto
>      Use --yes to run
>      Detected devices:
>        [rotational] /dev/sda    1TB
>        [rotational] /dev/sdb    1TB
>        [rotational] /dev/sdc    1TB
>
>      Expected Bluestore OSDs:
>
>       data: /dev/sda (100%)
>       data: /dev/sdb (100%)
>       data: /dev/sdc (100%)
>
> This scenario will detect a single type of unused device (rotational)
> so the bluestore
> OSD will be created on each without block.db or block.wal
>
>
> Mixed devices:
>
>     $ ceph-volume auto
>      Use --yes to run
>      Detected devices:
>        [rotational] /dev/sda    1TB
>        [rotational] /dev/sdb    1TB
>        [rotational] /dev/sdc    1TB
>        [solid     ] /dev/sdd    500GB
>
>      Expected Bluestore OSDs:
>
>       data: /dev/sda (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdb (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdc (100%), block.db: /dev/sdd (33%)
>
> This scenario will detect the unused devices in the system and understand that
> there is a mix of solid and rotational devices, will place block on the
> rotational ones, and will split the ssd in as many rotational devices found (3
> in this case).
>
>
> Semi configurable outcome, with input:
> --------------------------------------
> A user might not want to consume the devices that were automatically detected
> in the system as free, so the interface will allow to pass these devices
> directly as input.
>
>     $ ceph-volume auto /dev/sda /dev/sdb /dev/sdc
>      Device information:
>        [rotational] /dev/sda    1TB
>        [rotational] /dev/sdb    1TB
>        [rotational] /dev/sdc    1TB
>
>      Expected Bluestore OSDs:
>
>       data: /dev/sda (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdb (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdc (100%), block.db: /dev/sdd (33%)
>
>     Please hit Enter to continue, or Ctrl-C to cancel
>
> Similarly, for mixed devices:
>
>     $ ceph-volume auto /dev/sda /dev/sdb /dev/sdc /dev/sdd
>      Use --yes to run
>      Device information:
>        [rotational] /dev/sda    1TB
>        [rotational] /dev/sdb    1TB
>        [rotational] /dev/sdc    1TB
>        [solid     ] /dev/sdd    500GB
>
>      Expected Bluestore OSDs:
>
>       data: /dev/sda (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdb (100%), block.db: /dev/sdd (33%)
>       data: /dev/sdc (100%), block.db: /dev/sdd (33%)
>
>     Please hit Enter to continue, or Ctrl-C to cancel
>
>
> Fully Manual (config management systems):
> -----------------------------------------
> A JSON file or a blob as a positional arugment would allow fine tunning other
> specifics, like using 2 OSDs per NVMe device, determine an exact size for
> a block.db or even a block.wal LV.
>
>     $ ceph-volume auto /etc/ceph/custom_osd_provisioning.json
>
> Or:
>
>     $ ceph-volume auto "{ ... }"
>
>
> Here the API is still undefined as of now, but the idea is to expand on more
> complex setups that can be better managed by configuration management systems
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux