Re: Keynote: What's Planned for Ceph Octopus - Sage Weil -> Feedback on cephs Usability

Sebastian Wagner <swagner@xxxxxxxx> · Tue, 28 May 2019 14:34:32 +0200

Hey Owen,

thanks for your detailed feedback!

Am 25.05.19 um 04:13 schrieb Owen Synge:
> Dear Ceph team,
> 
> I have been watching Sages 5 these for octopus, and a love the themes,
> and all of Sages talk.
> 
> Sages talk mentioned cluster usability.
> 
> On 'the orchestrate API' slide, sage slides talk about a "Partial
> consensus to focus efforts on":
> 
> (Option) Rook (which I don't know, but depends on Kubernetes)
> 
> (Option) ssh (or maybe some rpc mechanism).
> 
> I was sad not to see the option
> 
> (Option) Support a common the most popular declarative
> puppet/chef/cfengine module.

It also depends on the amount of upstream contributions we're getting.

> 
> I think option (ssh) exists only because work has been invested in
> complex salt and ansible implementations, but that never seem to reduce
> in complexity. I propose we chalk it down to mistakes we made and gain
> some wisdom why option (ssh) took much more effort than expected, and
> learn from Option (Rook).>
> I think option (Rook) is a very good idea, as it works on sounds ideas I
> have seen work before.
> 
> I understand that ceph should not *only* depend on anything as complex
> as Kubernetes as a deployment dependency, even if it is the best
> solution. I may not want to run some thing as complex as Kubernetes just
> to run ceph.

Yep, that's one idea behind the SSH orchestrator. We have ceph-deploy
prominently advertised in the documentation, because we don't have a
replacement for this use case yet.

> 
> I would have liked to see on the slides:
> 
> (Option) Look how to get Rook's benefits without Kubernetes

The reality is: Every external orchestrator (like Rook, DeepSea,
ceph-ansible) has its very own idea of how things should be defined:

Rook uses a bunch of CustomResources that define the desired state of
the cluster.

DeepSea uses the policy.cfg to define the desired start.

The SSH orchestrator uses the modules
s persistent key value store to remember a list of managed hosts.

There is simply no need to invent a new source of truth within the MGR.

Secondly, I simply don't want to maintain a function between the
orchestratemap and every external orchestrator's configuration.
Maintaining a set of state changes (orchestrator.py) is enough.

Thus, I'd stick with the source of truth we currently already have.

> 
> I believe Rook's dependency Kubernetes, provides an architecture based
> on a declarative configuration and shared service state makes managing
> clusters easier.

Yep. Rook's CustomResources like CephCluster are a great way to maintain
the state of the cluster.

> In other words Kubernetes is like service version of
> cephs crushmap which describes how data is distributed in ceph.
> 
> To implement (Names can be changed and are purely for illustration)
> "orchestratemapfile" -> desired deployment configfile
>     'orchestratemap' -> compiled with local state orchestratemapfile
> 
>     'liborchestrate' -> shares and executes orchestratemap
> 
> So any ceph developer can understand, just like the crushmap is
> declarative and drives data, The "orchestratemap" should be declarative
> and drive the deployment. The crushmap is shared state across the
> cluster, the orchestratemap would be a shared state across the cluster.
> A crushmap is a compiled crushmapfile with state about the cluster. A
> orchestratemap is compiled from a orchestratemapfile with state about
> the cluster.
> 
> Just like librados can read a crushmap and speak to a mon to get cluster
> status, and drive data flow, liborchestrate

Yes, exactly! But instead of liborchestrate, we have a set of commands
defined to read and write the state of the Ceph cluster, like e.g.

ceph orchestrator service ls

> can read a orchestratemap,
> and drive the stages of ceph deployment, A MVP* would function with
> minor degradation even without shared cluster state. (ie no
> orchestratemap).

Do you have the code available? Would be great to have a look at it.
Which operations did you define? Which parameters? Which data structures?

> 
> A good starting point for the orchestratemapfile would be the Kubernetes
> config for rook, as this is essentially a desired state for the cluster.
> 
> If you add the current state locally into the orchestratemap when
> compiling the orchestratemapfile, All desired possible operations can be
> calculated by each node using just the orchestratemap and the current
> local state independently. All the operations that must be delayed due
> to dependencies in other operations can also be calculated for each
> node, this avoids, retry, timeouts, and instantly reduces error handling
> and allows for ceph to potentially, save the user from knowing that more
> than one deamon is running to provide ceph, staged upgrades,practice
> self healing at the service level, guide the users deployment with more
> helpful error messages, and many other potential enhancements.

The Rook orchestrator is indeed much simpler, as it just needs to update
the CustomResources in Kubernetes. The rest is done by K8s and the Rook
operator (delayed operations, retries, timeouts, error handling).

> 
> It may be argued that Option (ssh) is simpler than implementing an
> "orchestratemap" and liborchestrate that reads it, and I argue Option
> (ssh) is simpler for a test grade MVP, but for a production grade MVP
> solution I suspect implementing an "orchestratemap" and liborchestrate
> is simpler due to simpler synchronization, planning and error handling
> for management of ceph, just like the crushmap simplifies
> synchronization, planning and error handling for data in ceph.

The idea of the SSH orchestartor is to be simpler than Rook +
Kubernetes: Meaning we should not re-implement Kubernetes and Rook
within the SSH orchestrator.

> 
> Good luck and have fun,

Thanks again for your ideas!

Best,
Sebastian

> 
> Owen Synge
> 
> 
> * I once nearly finished an orchestratemapfile to ceph configuration
> once (no shared cluster state), and the bulk of the work was
> understanding how each ceph daemon interact with the cluster during
> boot, and commands to manage the demon. Only the state serialization,
> comparison and propagation where never completed.
> 
> 

-- 
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG Nürnberg)

Attachment:
signature.asc

Description: OpenPGP digital signature