Re: handling osd removal with ceph-volume?

Jan Fajerski <jfajerski@xxxxxxxx> · Fri, 26 Oct 2018 17:21:53 +0200

On Fri, Oct 26, 2018 at 11:06:49AM -0400, Alfredo Deza wrote:
On Fri, Oct 26, 2018 at 11:00 AM Jan Fajerski <jfajerski@xxxxxxxx> wrote:

On Fri, Oct 26, 2018 at 08:06:34AM -0400, Alfredo Deza wrote:
>On Fri, Oct 26, 2018 at 7:11 AM John Spray <jspray@xxxxxxxxxx> wrote:
>>
>> On Thu, Oct 25, 2018 at 11:08 PM Noah Watkins <nwatkins@xxxxxxxxxx> wrote:
>> >
>> > After speaking with Alfredo and the orchestrator team, it seems there
>> > are some open questions (well, maybe just questions whose answers need
>> > to be written down) about OSD removal with ceph-volume.
>> >
>> > Feel free to expand the scope of this thread to the many different
>> > destruction / deactivation scenarios, but we have been driven
>> > initially by the conversion of one ceph-ansible playbook that removes
>> > a specific OSD from the cluster that boils down to:
>> >
>> >   1. ceph-disk deactivate --deactivate-by-id ID --mark-out
>> >   2. ceph-disk destroy --destroy-by-id ID --zap
>> >   3. < manually destroy partitions from `ceph-disk list` >
>> >
>> > To accomplish the equivalent without ceph-disk we are doing the following:
>> >
>> >   1. ceph osd out ID
>> >   2. systemctl disable ceph-osd@ID
>> >   3. systemctl stop ceph-osd@ID
>> >   4. something equivalent to:
>> >     | osd_devs = ceph-volume lvm list --format json
>> >     | for dev in osd_devs[ID]:
>> >     |    ceph-volume lvm zap dev["path"]
>> >   5. ceph osd purge ID
>> >
>> > This list seems to be complete after examining ceph docs and
>> > ceph-volume itself. Is there anything missing? Similar questions here:
>> > http://tracker.ceph.com/issues/22287
>> >
>> > Of these steps, the primary question that has popped up is how to
>> > maintain outside of ceph-volume, the inverse of the systemd unit
>> > management that ceph-volume takes care of during OSD creation (e.g.
>> > ceph-osd and ceph-volume units), and whether that inverse operation
>> > should be a part of ceph-volume itself.
>>
>> My suggestion would be to have a separation of the three aspects of
>> creating/destroying OSDs:
>>  A) The drive/volume manipulation part (ceph-volume)
>>  B) Enabling/disabling execution of the ceph-osd process (systemd,
>> containers, something else...)
>>  C) The updates to Ceph cluster maps (ceph osd purge, ceph osd destroy etc)
>>
>> The thing that ties all three together would live up at the ceph-mgr
>> layer, where a high level UI (the dashboard and new CLI bits) would
>> tie it all together.
>
>This proposed separation is at odds to what ceph-volume does today.
>All three happen when provisioning an OSD. Not having some counterpart
>for deactivation would cause similar confusion as today: why enabling
>happens in ceph-volume while disabling/deactivation is not there?
>
>>
>> That isn't to exclude having functionality in ceph-volume where it's a
>> useful convenience (e.g. systemd), but in general ceph-volume can't be
>> expected to know how to start OSD services in e.g. Kubernetes
>> environments.
>
>The same could be said when provisioning. How does ceph-volume knows
>how to provision an OSD in kubernetes? It doesn't. What we do there
>is enable certain functionality that containers can make use of, for
>example do all the activation but skip the systemd enabling.
>
>There are a couple of reasons why 'deactivate' hasn't made it into
>ceph-volume. One of them is that, it wasn't clear (to me) if
>deactivation meant full removal/purging of the OSD or if
>it meant to leave it in a state where it wouldn't start (e.g.
>disabling the systemd units).
>
>My guess is that there is a need for both and for a few more use
>cases, like disabling the systemd unit so that the same OSD can be
>provisioned. So far we've concentrated in the creation of OSDs
>surpassing ceph-disk
>features, but I think that we can start exploring the complexity of
>deactivation now.
Yeah that would be great. I was wondering about lvm management that might relate
to this. Afaiu (and please correct me if I'm wrong) c-v does some basic lvm
management when a block device is passed as --data but to get an lv as wal/db
devices it must be created beforehand.
Would it make sense to add a dedicated lvm management layer to c-v or was this
ruled out long ago?

We have! It is now part of the `ceph-volume lvm batch` sub-command
which will create everything for you given
an input of devices.
Ahh lovely. Thx and sorry for being a bit slow :/
So once the Drive Group stuff is hammered down it could feed into batch for more 
flexibility I guess.

http://docs.ceph.com/docs/master/ceph-volume/lvm/batch/

I think this could also have benefits for other operation regarding lv's, like
renaming and growing an lv (I believe Igor was looking into growing a wal/db lv
and then growing BlueFS after that).

Best,
Jan
>
>
>>
>> John
>>
>> > My understanding of the systemd process for ceph is that the
>> > ceph-volume unit itself activates the corresponding OSD using the
>> > ceph-osd systemd template--so there isn't any osd-specific unit files
>> > to clean up when an OSD is removed. That still leaves the question of
>> > how to properly remove the ceph-volume units if that is indeed the
>> > process that needs to occur. Glancing over the zap code, it doesn't
>> > look like zap handles that task. Related tracker here:
>> > http://tracker.ceph.com/issues/25029
>> >
>> > In the ceph docs it seems to only indicate that the OSD needs to be
>> > stopped, and presumably there are other final clean-up steps?
>> >
>> >
>> > - Noah
>

--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)

--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)