Dear Ceph team,
I have been watching Sages 5 these for octopus, and a love the themes,
and all of Sages talk.
Sages talk mentioned cluster usability.
On 'the orchestrate API' slide, sage slides talk about a "Partial
consensus to focus efforts on":
(Option) Rook (which I don't know, but depends on Kubernetes)
(Option) ssh (or maybe some rpc mechanism).
I was sad not to see the option
(Option) Support a common the most popular declarative
puppet/chef/cfengine module.
I think option (ssh) exists only because work has been invested in
complex salt and ansible implementations, but that never seem to reduce
in complexity. I propose we chalk it down to mistakes we made and gain
some wisdom why option (ssh) took much more effort than expected, and
learn from Option (Rook).
I think option (Rook) is a very good idea, as it works on sounds ideas I
have seen work before.
I understand that ceph should not *only* depend on anything as complex
as Kubernetes as a deployment dependency, even if it is the best
solution. I may not want to run some thing as complex as Kubernetes just
to run ceph.
I would have liked to see on the slides:
(Option) Look how to get Rook's benefits without Kubernetes
The rest of the email explains how I think ceph should best be
configured without Kubernetes in a ceph like way.
I believe Rook's dependency Kubernetes, provides an architecture based
on a declarative configuration and shared service state makes managing
clusters easier. In other words Kubernetes is like service version of
cephs crushmap which describes how data is distributed in ceph.
To implement (Names can be changed and are purely for illustration)
"orchestratemapfile" -> desired deployment configfile
'orchestratemap' -> compiled with local state orchestratemapfile
'liborchestrate' -> shares and executes orchestratemap
So any ceph developer can understand, just like the crushmap is
declarative and drives data, The "orchestratemap" should be declarative
and drive the deployment. The crushmap is shared state across the
cluster, the orchestratemap would be a shared state across the cluster.
A crushmap is a compiled crushmapfile with state about the cluster. A
orchestratemap is compiled from a orchestratemapfile with state about
the cluster.
Just like librados can read a crushmap and speak to a mon to get cluster
status, and drive data flow, liborchestrate can read a orchestratemap,
and drive the stages of ceph deployment, A MVP* would function with
minor degradation even without shared cluster state. (ie no orchestratemap).
A good starting point for the orchestratemapfile would be the Kubernetes
config for rook, as this is essentially a desired state for the cluster.
If you add the current state locally into the orchestratemap when
compiling the orchestratemapfile, All desired possible operations can be
calculated by each node using just the orchestratemap and the current
local state independently. All the operations that must be delayed due
to dependencies in other operations can also be calculated for each
node, this avoids, retry, timeouts, and instantly reduces error handling
and allows for ceph to potentially, save the user from knowing that more
than one deamon is running to provide ceph, staged upgrades,practice
self healing at the service level, guide the users deployment with more
helpful error messages, and many other potential enhancements.
It may be argued that Option (ssh) is simpler than implementing an
"orchestratemap" and liborchestrate that reads it, and I argue Option
(ssh) is simpler for a test grade MVP, but for a production grade MVP
solution I suspect implementing an "orchestratemap" and liborchestrate
is simpler due to simpler synchronization, planning and error handling
for management of ceph, just like the crushmap simplifies
synchronization, planning and error handling for data in ceph.
Good luck and have fun,
Owen Synge
* I once nearly finished an orchestratemapfile to ceph configuration
once (no shared cluster state), and the bulk of the work was
understanding how each ceph daemon interact with the cluster during
boot, and commands to manage the demon. Only the state serialization,
comparison and propagation where never completed.