Re: Seeking feedback on Improving cephadm bootstrap process

Frank Schilder <frans@xxxxxx> · Tue, 30 May 2023 09:23:39 +0000

What I'm having in mind is if the command is already in history. A wrong history reference can execute a command with "--yes-i-really-mean-it" even though you really don't mean it. Been there. For an OSD this is maybe tolerable, but for an entire cluster ... not really. Some things need to be hard to limit the blast radius of a typo (or attacker).

For example, when issuing such a command the first time, the cluster could print a nonce that needs to be included in such a command to make it happen and which is only valid once for this exact command, so one actually needs to type something new every time to destroy stuff. An exception could be if a "safe-to-destroy" query for any daemon (pool etc.) returns true.

I would still not allow an entire cluster to be wiped with a single command. In a single step, only allow to destroy what could be recovered in some way (there has to be some form of undo). And there should be notifications to all admins about what is going on to be able to catch malicious execution of destructive commands.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Nico Schottelius <nico.schottelius@xxxxxxxxxxx>
Sent: Tuesday, May 30, 2023 10:51 AM
To: Frank Schilder
Cc: Nico Schottelius; Redouane Kachach; ceph-users@xxxxxxx
Subject: Re:  Re: Seeking feedback on Improving cephadm bootstrap process

Hey Frank,

in regards to destroying a cluster, I'd suggest to reuse the old
--yes-i-really-mean-it parameter, as it is already in use by ceph osd
destroy [0]. Then it doesn't matter whether it's prod or not, if you
really mean it ... ;-)

Best regards,

Nico

[0] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/

Frank Schilder <frans@xxxxxx> writes:

> Hi, I would like to second Nico's comment. What happened to the idea that a deployment tool should be idempotent? The most natural option would be:
>
> 1) start install -> something fails
> 2) fix problem
> 3) repeat exact same deploy command -> deployment picks up at current state (including cleaning up failed state markers) and tries to continue until next issue (go to 2)
>
> I'm not sure (meaning: its a terrible idea) if its a good idea to
> provide a single command to wipe a cluster. Just for the fat finger
> syndrome. This seems safe only if it would be possible to mark a
> cluster as production somehow (must be sticky, that is, cannot be
> unset), which prevents a cluster destroy command (or any too dangerous
> command) from executing. I understand the test case in the tracker,
> but having such test-case utils that can run on a production cluster
> and destroy everything seems a bit dangerous.
>
> I think destroying a cluster should be a manual and tedious process
> and figuring out how to do it should be part of the learning
> experience. So my answer to "how do I start over" would be "go figure
> it out, its an important lesson".
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Nico Schottelius <nico.schottelius@xxxxxxxxxxx>
> Sent: Friday, May 26, 2023 10:40 PM
> To: Redouane Kachach
> Cc: ceph-users@xxxxxxx
> Subject:  Re: Seeking feedback on Improving cephadm bootstrap process
>
>
> Hello Redouane,
>
> much appreciated kick-off for improving cephadm. I was wondering why
> cephadm does not use a similar approach to rook in the sense of "repeat
> until it is fixed?"
>
> For the background, rook uses a controller that checks the state of the
> cluster, the state of monitors, whether there are disks to be added,
> etc. It periodically restarts the checks and when needed shifts
> monitors, creates OSDs, etc.
>
> My question is, why not have a daemon or checker subcommand of cephadm
> that a) checks what the current cluster status is (i.e. cephadm
> verify-cluster) and b) fixes the situation (i.e. cephadm verify-and-fix-cluster)?
>
> I think that option would be much more beneficial than the other two
> suggested ones.
>
> Best regards,
>
> Nico

--
Sustainable and modern Infrastructures by ungleich.ch
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx