Re: Why you might want packages not containers for Ceph deployments

Frank Schilder <frans@xxxxxx> · Wed, 17 Nov 2021 19:57:05 +0000

> The real point here:  From what I'm reading in this mailing list it
> appears that most non-developers are currently afraid to risk an upgrade to
> Octopus or Pacific.  If this is an accurate perception then THIS IS THE
> ONLY PROBLEM.

No, it is not. The problem is the reduction of code quality due to a constant short release cycle while complexity is increasing fast. Its a simple unavoidable consequence of code development. A brilliant example is given by Janne Johannson.

Its a contradiction to have all these three properties at the same time:

- constant increase of features (complexity)
- constant level of code quality
- constant release interval

You have to sacrifice one to get the other two and with N but latest with O we seem to have reached a point where the quite reasonable prediction is that future releases will reach EOL long before "rock solid".

In this context, I find it quite disturbing that nobody is willing even to discuss an increase of the release cycle from say 2 to 4 years. What is so important about pumping out one version after the other that real issues caused by this speed are ignored?

The current perception is that newer releases are piling up more and more unresolved quality problems (like an MDS cache performance bottleneck reported last week due to a loop that is executed over the entire cache for every operation when snapshots are present) and actual bugs like the one Janne reported earlier, because the devs don't have sufficient time to work on these.

Why is it so difficult to slow down? I don't get it.

> ... so your problem will be gone as soon as you A) contribute code
> yourself or B) pay someone to contribute code.

Yes, I would like to. However, this is also made a bit difficult by the fast release cadence. Firstly, its not only developers who are important for code improvement. The people running the software under real conditions and reporting back their findings are equally important. Just dismissing a user because "he doesn't contribute code back herself" is ridiculing the value of people actually running the stuff and spending time to find what is causing problems. You cannot have one without the other and one group should not look down on the other either.

You need both for real success and it is one of the strengths of ceph that a huge user group exists who is doing this work, contributing immensely to the faith in ceph that allows businesses to flourish on top of it.

For example, I run a still mimic cluster on which I recently run into a probably extremely rare permanent degradation. I will never see helth_ok again, which kind of is a requirement for upgrades. I'm more than willing to look into it, but can't get assistance I would need because it is "out of support" even though everyone knows that it doesn't matter which version a bug is discovered in, its in the code base until fixed.

Similarly, I was helping to find an OSD memory leak that is almost certainly present until now. Our users were suffering 2 months of slow ops caused by the one OSD for which I enabled heap debugging (you need to do this on production set-ups; see next point) and I collected a huge amount of data probably exposing the leak. However, after I reported the results of the collection, I never heard anything back.

I honestly do perceive a change in attitude since RH merged with IBM. It is not the good kind of change though. I'm afraid the devs are under pressure to be profitable and IBM itself doesn't care, or worse.

> Why RCs? Because our environments are so diverse, complete test 
> coverage will always be a challenge.

I personally don't really see a point in this (maybe others do). We have production clusters with real workloads where problems are detected. It is illusory to assume that one can reproduce rare issues that occur after a long time or require size and real work loads on a test cluster. Our set-up is a 500 node HPC cluster connected to a 11PB ceph fs with a huge spectrum of workloads. What test cluster (including the client part) would conceivably allow to simulate this?

I think it would make a lot more sense if the observations and discoveries made with production clusters - even or in particular after a long run time on the battle field - were incorporated going much longer back than 4 years. For a system like ceph 4 years is nothing.

But - and here I come back to my main point - this would require a very scarce resource: time. This is what it all really is about. A slower release cadence would provide time to look into long-term issues and hard challenges with, for example, cache algorithms.

It would, to add a secondly to the firstly above, also give a chance to more people to look at the code, because its not EOL already and getting help becomes close to impossible.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx