Re: [ceph-users] Re: RFC: (deep-)scrub manager module

Stefan Kooman <stefan@xxxxxx> · Thu, 18 Aug 2022 15:15:36 +0200

On 7/26/22 10:23, Frank Schilder wrote:
Hi all,

I would also like to ship in. There seem to be two approaches here: a centralised deep-scrub manager module or a distributed algorithm. My main point on a high general level is that:

     Whenever I read "manager module/daemon", what I understand is "does not scale".

For a scale-out system like ceph, the basic idea of having a central instance orchestrating globally does not scale. 
\In conclusion, such algorithms/orchestrators should never be implemented.

That does not need to be true. An (IMHO) interesting paper on this is 
Meta's "Owl", see [1] for the pdf. There they combine a central control 
plane and a distributed data plane. In this way system (cluster) wide 
decisions can be made while the data plane is distributed. In Ceph the 
manager could fulfill the same role: make decisions based on a holistic 
view (cluster wide) and delegate tasks where it makes sense. Ideally the 
manager can distribute tasks within all available managers as to better 
scale / make use of available resources.

Like Cephalopods, it makes sense to distribute / delegate certain tasks 
and decision making, and centralize others, depending on the goals one 
wants to achieve (there are always trade offs). I'm not sure if the 
improvements Ronen is making are a result  of a discussion within the 
Ceph (developer) community with this in mind. Or that changes / 
improvements in general follow this thought process.

I have already serious questions why modules can only run on one manager 
instead of being distributed over several MGRs. However, on a deeper 
level, the whole concept fails when it comes to scaling. The question to 
keep in mind for any choice of method/algorithm should be

     How does this work on an infinitely large cluster?

You can also start with a galaxy-sized cluster if infinite is a bit too large. This raises some really interesting follow-up questions that people with large enough clusters are starting to see becoming relevant already today:

- What does a quorum mean (its impossible to have complete information at a single point in space and time)?
- How can one operate a storage system with incomplete information?
- What is required for declaring a part of a cluster healthy (there is always an infinite amount of hardware down)?
- How can upgrades be performed (the cluster will be multi-version by nature)?
- How are upgrades even distributed?
- How do new ideas spread through the cluster without breaking inter-operability?
- What would a suitable next-neighbour network look like (neighbours are peers, peer MONs etc)?
- How could networking on the neighbour graph be defined?
- How does one build OSD maps and crush maps (there is always an infinite amount of changes pending)?

This may sound silly, but thinking about such questions helps a lot in guiding development in a direction that will produce manageable well-performing scale-out clusters. A first question to investigate would be what are the minimum conditions for such a system to make sense (like a set of axioms). Is it possible to formulate a set of conditions such that non-trivial infinite clusters exist (a trivial infinite cluster is simply the union of infinitely many finite and independent clusters, a purely formal construct with no consequence on any member cluster)? For example, every pool can only be finite. A finite number of monitors can only be managing a final number of pools (a quorum). The range of responsibility of a quorum of monitors can only overlap with a finite number of other quorums. And so on. There is a whole theory to be developed. A very important conclusion for algorithms we can already draw from this thought experiment is:

     Instead of central manager daemons, prefer cellular automata with desired emerging behaviour at scale.

Therefore, I would advocate trying to distribute and localise as much as possible instead of adding one manager module after the other. Its just piling up bottlenecks. Instead of a central manager, think about a cellular automata algorithm that produces a deep-scrub scheduling method that scales and guarantees fast scrub cycles without producing optimal scrub cycles but working at scale. Likewise with anything else.

For example, current clusters are starting to reach a size where HEALTH_OK is the exception rather than the rule. If MONs continue to hoard PG LOG history on HEALTH_WARN without understanding a concept of partial health on a subset of pools and trimming history information on healthy subsets, they will overrun their disks. I think a number of tasks that MONs are doing today can be delegated to OSD threads, for example, pool health. I'm afraid there is already a lot in ceph (specifically, in the MONs and MGRs) that does not scale and I would avoid adding to that list at all cost. The downsides of this are likely become important over the next 10 years on extremely large clusters.

Your concept of "meta clusters" is interesting. It might be possible to 
implement this with the centralized control plane and a distributed 
dataplane concept I mentioned before. You would basically have Ceph 
clusters that scale to a certain extent, and a central (distributed) 
control plane that handles all the meta clusters. Something like this 
might be possible. But I'm not sure if "infinite scalability" is a goal 
of the Ceph project.

As a guideline for devs, it is OK to provide APIs for manager modules so that users with finite clusters can hook into it and write and share their own bottlenecks. However, core development time should exclusively be spent on distributed algorithms that scale and would work on an infinitely large cluster.

Looking forward to the first inter-planetary ceph cluster :)

Gr. Stefan

[1]: 
https://research.facebook.com/file/761835991935894/Owl--Scale-and-Flexibility-in-Distribution-of-Hot-Content.pdf
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx