Re: Seeking feedback on Improving cephadm bootstrap process

Nico Schottelius <nico.schottelius@xxxxxxxxxxx> · Tue, 30 May 2023 10:51:40 +0200

Hey Frank,

in regards to destroying a cluster, I'd suggest to reuse the old
--yes-i-really-mean-it parameter, as it is already in use by ceph osd
destroy [0]. Then it doesn't matter whether it's prod or not, if you
really mean it ... ;-)

Best regards,

Nico

[0] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/

Frank Schilder <frans@xxxxxx> writes:

> Hi, I would like to second Nico's comment. What happened to the idea that a deployment tool should be idempotent? The most natural option would be:
>
> 1) start install -> something fails
> 2) fix problem
> 3) repeat exact same deploy command -> deployment picks up at current state (including cleaning up failed state markers) and tries to continue until next issue (go to 2)
>
> I'm not sure (meaning: its a terrible idea) if its a good idea to
> provide a single command to wipe a cluster. Just for the fat finger
> syndrome. This seems safe only if it would be possible to mark a
> cluster as production somehow (must be sticky, that is, cannot be
> unset), which prevents a cluster destroy command (or any too dangerous
> command) from executing. I understand the test case in the tracker,
> but having such test-case utils that can run on a production cluster
> and destroy everything seems a bit dangerous.
>
> I think destroying a cluster should be a manual and tedious process
> and figuring out how to do it should be part of the learning
> experience. So my answer to "how do I start over" would be "go figure
> it out, its an important lesson".
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Nico Schottelius <nico.schottelius@xxxxxxxxxxx>
> Sent: Friday, May 26, 2023 10:40 PM
> To: Redouane Kachach
> Cc: ceph-users@xxxxxxx
> Subject:  Re: Seeking feedback on Improving cephadm bootstrap process
>
>
> Hello Redouane,
>
> much appreciated kick-off for improving cephadm. I was wondering why
> cephadm does not use a similar approach to rook in the sense of "repeat
> until it is fixed?"
>
> For the background, rook uses a controller that checks the state of the
> cluster, the state of monitors, whether there are disks to be added,
> etc. It periodically restarts the checks and when needed shifts
> monitors, creates OSDs, etc.
>
> My question is, why not have a daemon or checker subcommand of cephadm
> that a) checks what the current cluster status is (i.e. cephadm
> verify-cluster) and b) fixes the situation (i.e. cephadm verify-and-fix-cluster)?
>
> I think that option would be much more beneficial than the other two
> suggested ones.
>
> Best regards,
>
> Nico

--
Sustainable and modern Infrastructures by ungleich.ch
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx