Re: ceph-volume and automatic OSD provisioning

Alfredo Deza <adeza@xxxxxxxxxx> · Thu, 21 Jun 2018 11:09:48 -0400

On Thu, Jun 21, 2018 at 10:42 AM, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
> A couple thoughts wrt what I've seen so far:
>
> o Would this require that the metadata devices be empty?

Yes, all devices to be considered must be empty/clean/unused

>
> o If an OSD drive bites the dust, how does one identify the metadata device/partition it was using so that it can be wiped, re-used, etc?

`ceph-volume lvm list` will give you all the output you need per OSD
to identify what lv or partition is part of it

>
> o How does this fit into the ongoing OSD lifecycle?  Ie., when an OSD dies, is removed completely, and is redeployed, does the code reuse the same metadata partition(s), or does it attempt to create new on an available device?  If the latter, it's going to run out sooner or later.

This wouldn't be covered as a use case. One would have to create the
LV(s) and use `ceph-volume lvm create [...]` or go with a higher level
tool

>
> o The above, but with a *destroyed* OSD? Or if an OSD is repaved for whatever reason -- differing parameters, Filestore <--> Bluestore <--> whatever?  What happens if one changes the size of metadata partition required after initial deployment?
>

Unsupported

> o I've been curious how people so far have managed the OSD:journal/metadata partition mapping.  In the past we had a wrapper around ceph-deploy with a rigid mapping of OSD drive slot to partition number.  It required the single NVMe device to be pre-partitioned and was kind of ugly and error-prone.  The drive slot was used instead of sdX name given the Linux kernel's fondness for the mapping to change as a result of various drive failure / replacement scenarios.

ceph-volume doesn't have any of these issues because it relies on LVM
and/or blkid with PARTUUID to identify names regardless of name
changes

>
> o Some sites with multiple HBAs, NICs, metadata devices etc. go to great lengths to pin resources on common CPU cores, PCIe slots, and especially NUMA nodes; chances are good that such a deployment couldn't use this.
>
> I totally understand and support the idea of auto-selecting a metadata device/partition, managing them can be a bear, but I humbly submit that attention needs to be paid to the needs OSD lifecycle events and the various dynamics that can happen to a production cluster over the years.
>
> Notably it would be really really nice to have the ability to configure mapping rules, or even a simple hardcoded EID:SLOT -> device/partition # mapping.
>
> Apologies if any of these were already covered or are out of scope.
>
> -- Anthony
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html