Re: ceph-volume and automatic OSD provisioning

John Spray <jspray@xxxxxxxxxx> · Thu, 21 Jun 2018 16:11:11 +0100

On Thu, Jun 21, 2018 at 3:42 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
>
> A couple thoughts wrt what I've seen so far:
>
> o Would this require that the metadata devices be empty?
>
> o If an OSD drive bites the dust, how does one identify the metadata device/partition it was using so that it can be wiped, re-used, etc?
>
> o How does this fit into the ongoing OSD lifecycle?  Ie., when an OSD dies, is removed completely, and is redeployed, does the code reuse the same metadata partition(s), or does it attempt to create new on an available device?  If the latter, it's going to run out sooner or later.
>
> o The above, but with a *destroyed* OSD? Or if an OSD is repaved for whatever reason -- differing parameters, Filestore <--> Bluestore <--> whatever?  What happens if one changes the size of metadata partition required after initial deployment?

My thought on the OSD lifecycle stuff is that it belongs at higher
levels (ceph-volume is not the whole story).  At some higher level we
would have a persistent record of which devices had previously been
used as OSDs, in order to recreate them on failure.  The orchestrator
(rook, ceph-ansible, deepsea) would contain an opinionated policy
about how to treat the configuration through replacements: whether
selecting a device means literally just that device, or that device
and subsequently any empty replacement that shows up in the same slot.

I think we need to have a tight scope around what ceph-volume's device
selection does: it's there to pick a default (something reasonable but
not necessarily optimal), to work okay on most systems (but not
necessarily all), and to make a device selection for installation (not
to handle the OSD lifecycle overall).

John

> o I've been curious how people so far have managed the OSD:journal/metadata partition mapping.  In the past we had a wrapper around ceph-deploy with a rigid mapping of OSD drive slot to partition number.  It required the single NVMe device to be pre-partitioned and was kind of ugly and error-prone.  The drive slot was used instead of sdX name given the Linux kernel's fondness for the mapping to change as a result of various drive failure / replacement scenarios.
>
> o Some sites with multiple HBAs, NICs, metadata devices etc. go to great lengths to pin resources on common CPU cores, PCIe slots, and especially NUMA nodes; chances are good that such a deployment couldn't use this.
>
> I totally understand and support the idea of auto-selecting a metadata device/partition, managing them can be a bear, but I humbly submit that attention needs to be paid to the needs OSD lifecycle events and the various dynamics that can happen to a production cluster over the years.
>
> Notably it would be really really nice to have the ability to configure mapping rules, or even a simple hardcoded EID:SLOT -> device/partition # mapping.
>
> Apologies if any of these were already covered or are out of scope.
>
> -- Anthony
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html