Hi Frank,
While I generally agree with your approach, I beg to differ in our specific use case. For some examples, (deep scrub, scrub, capacity balancing) the cost of the management (building a plan) is tiny vs the cost of implementation (doing the scrub, moving data between OSDs) - in these cases if the algorithm that builds the plan is fast enough (ie not exponential in cluster size) and can run in slices (so it does not block any manager module) it is not that bad if it is executed centrally, and sees the entire picture. Building an optimized plan that sees the entire picture in a distributed manner is not a simple task either, and eventually, my assumption is that we want to implement this feature.
I know this is not a good response from a computer science perspective, but I am sure we will have many other problems before we deploy the first infinitely large Ceph cluster. And I believe the approach we describe here is good enough for very large finite-sized clusters (under the assumption that the plan building algorithm is fast enough and sliceable...)
In total, the management part is a tiny part of the effort/energy/work the cluster invests in scrubbing (which is performed in a distributed manner)
I know this is not a good response from a computer science perspective, but I am sure we will have many other problems before we deploy the first infinitely large Ceph cluster. And I believe the approach we describe here is good enough for very large finite-sized clusters (under the assumption that the plan building algorithm is fast enough and sliceable...)
In total, the management part is a tiny part of the effort/energy/work the cluster invests in scrubbing (which is performed in a distributed manner)
just my $.02
Regards,
Josh
On Tue, Jul 26, 2022 at 11:23 AM Frank Schilder <frans@xxxxxx> wrote:
Hi all,
I would also like to ship in. There seem to be two approaches here: a centralised deep-scrub manager module or a distributed algorithm. My main point on a high general level is that:
Whenever I read "manager module/daemon", what I understand is "does not scale".
For a scale-out system like ceph, the basic idea of having a central instance orchestrating globally does not scale. In conclusion, such algorithms/orchestrators should never be implemented. I have already serious questions why modules can only run on one manager instead of being distributed over several MGRs. However, on a deeper level, the whole concept fails when it comes to scaling. The question to keep in mind for any choice of method/algorithm should be
How does this work on an infinitely large cluster?
You can also start with a galaxy-sized cluster if infinite is a bit too large. This raises some really interesting follow-up questions that people with large enough clusters are starting to see becoming relevant already today:
- What does a quorum mean (its impossible to have complete information at a single point in space and time)?
- How can one operate a storage system with incomplete information?
- What is required for declaring a part of a cluster healthy (there is always an infinite amount of hardware down)?
- How can upgrades be performed (the cluster will be multi-version by nature)?
- How are upgrades even distributed?
- How do new ideas spread through the cluster without breaking inter-operability?
- What would a suitable next-neighbour network look like (neighbours are peers, peer MONs etc)?
- How could networking on the neighbour graph be defined?
- How does one build OSD maps and crush maps (there is always an infinite amount of changes pending)?
This may sound silly, but thinking about such questions helps a lot in guiding development in a direction that will produce manageable well-performing scale-out clusters. A first question to investigate would be what are the minimum conditions for such a system to make sense (like a set of axioms). Is it possible to formulate a set of conditions such that non-trivial infinite clusters exist (a trivial infinite cluster is simply the union of infinitely many finite and independent clusters, a purely formal construct with no consequence on any member cluster)? For example, every pool can only be finite. A finite number of monitors can only be managing a final number of pools (a quorum). The range of responsibility of a quorum of monitors can only overlap with a finite number of other quorums. And so on. There is a whole theory to be developed. A very important conclusion for algorithms we can already draw from this thought experiment is:
Instead of central manager daemons, prefer cellular automata with desired emerging behaviour at scale.
Therefore, I would advocate trying to distribute and localise as much as possible instead of adding one manager module after the other. Its just piling up bottlenecks. Instead of a central manager, think about a cellular automata algorithm that produces a deep-scrub scheduling method that scales and guarantees fast scrub cycles without producing optimal scrub cycles but working at scale. Likewise with anything else.
For example, current clusters are starting to reach a size where HEALTH_OK is the exception rather than the rule. If MONs continue to hoard PG LOG history on HEALTH_WARN without understanding a concept of partial health on a subset of pools and trimming history information on healthy subsets, they will overrun their disks. I think a number of tasks that MONs are doing today can be delegated to OSD threads, for example, pool health. I'm afraid there is already a lot in ceph (specifically, in the MONs and MGRs) that does not scale and I would avoid adding to that list at all cost. The downsides of this are likely become important over the next 10 years on extremely large clusters.
As a guideline for devs, it is OK to provide APIs for manager modules so that users with finite clusters can hook into it and write and share their own bottlenecks. However, core development time should exclusively be spent on distributed algorithms that scale and would work on an infinitely large cluster.
Looking forward to the first inter-planetary ceph cluster :)
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx