Re: CEPH Version choice

Joachim Kraftmayer - ceph ambassador <joachim.kraftmayer@xxxxxxxxx> · Mon, 15 May 2023 16:34:57 +0200

Hi,

I know the problems that Frank has raised. However, it should also be 
mentioned that many critical bugs have been fixed in the major versions. 
We are working on the fixes ourselves.

We and others have written a lot of tools for ourselves in the last 10 
years to improve migration/update and upgrade paths/strategy.

From version to version, we also test for up to 6 months before putting 
them into production.

However, our goal is always to use Ceph versions that still get 
backports and on the other hand, only use the features we really need.
Our developers also always aim to bring bug fixes upstream and into the 
supported versions.

By the way, regarding performance I recommend the Cephalocon 
presentations by Adam and Mark. There you can learn what efforts are 
made to improve ceph performance for current and future versions.

Regards, Joachim

___________________________________
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 15.05.23 um 12:11 schrieb Frank Schilder:
What are the main reasons for not upgrading to the latest and greatest?
Because more often than not it isn't.

I guess when you write "latest and greatest" you talk about features. When we admins talk about "latest and greatest" we talk about stability. The times that one could jump with a production system onto a "stable" release with the ending .2 are long gone. Anyone who becomes an early adapter is more and more likely to experience serious issues. Which leads to more admins waiting with upgrades. Which in turn leads to more bugs discovered only at late releases. Which again makes more admins postpone an upgrade. A vicious cycle.

A long time ago there was a discussion about exactly this problem and the admins were pretty much in favor of increasing the release cadence to at least 4 years if not longer. Its simply too many releases with too many serious bugs not fixed, lately not even during their official life time. Octopus still has serious bugs but is EOL.

I'm not surprised that admins give up on upgrading entirely and stay on a version until their system dies.

To give you one from my own experience, upgrading from mimic latest to octopus latest. This experience almost certainly applies to every upgrade that involves an OSD format change (the infamous "quick fix" that could take several days per OSD and crush entire clusters).

There is an OSD conversion involved in this upgrade and we found out that out of 2 possible upgrade paths, one leads to a heavily performance degraded cluster with no possibility to recover other than redeploying all OSDs step by step. Funnily enough, the problematic procedure is the one described in the documentation - it hasn't been updated until today despite users still getting caught in this trap.

To give you an idea of what amount of work is now involved in an attempt to avoid such pitfalls, here our path:

We set up a test cluster with a script producing realistic workload and started testing an upgrade under load. This took about a month (meaning repeating the upgrade with a cluster on mimic deployed and populated from scratch every time) to confirm that we managed to get onto a robust path avoiding a number of pitfalls along the way - mainly the serious performance degradation due to OSD conversion, but also an issue with stray entries plus noise. A month! Once we were convinced that it would work - meaning we did run it a couple of times without any further issues being discovered, we started upgrading our production cluster.

Went smooth until we started the OSD conversion of our FS meta data OSDs. They had a special performance optimized deployment resulting in a large number of 100G OSDs with about 30-40% utilization. These OSDs started crashing with some weird corruption. Turns out - thanks Igor! - that while spill-over from fast to slow drive was handled, the other direction was not. Our OSDs crashed because Octopus apparently required substantially more space on the slow device and couldn't use the plenty of fast space that was actually available.

The whole thing ended in 3 days of complete downtime and me working 12 hour days on the weekend. We managed to recover from this only because we had a larger delivery of hardware already on-site and I could scavenge parts from there.

So, the story was that after 1 month of testing we still run into 3 days of downtime, because there was another unannounced change that broke a config that was working fine for years on mimic.

To say the same thing with different words: major version upgrades have become very disruptive and require a lot of effort to get halfway right. And I'm not talking about the deployment system here.

Add to this list the still open cases discussed on the list about MDS dentry corruption, snapshots disappearing/corrupting together with a lack of good built-in tools for detection and repair, performance degradation etc. all not even addressed in pacific. In this state the devs are pushing for pacific becoming EOL while at the same time the admins become ever more reluctant to upgrade.

In my specific case, I planned to upgrade at least to pacific this year, but my time budget simply doesn't allow for the verification of the procedure and checking that all for us relevant bugs have been addressed. I gave up. Maybe next year. Maybe then its even a bit closer to rock solid.

So to get back to my starting point, we admins actually value rock solid over features. I know that this is boring for devs, but nothing is worse than nobody using your latest and greatest - which probably was the motivation for your question. If the upgrade paths were more solid and things like the question "why does an OSD conversion not lead to an OSD that is identical to one deployed freshly" or "where does the performance go" would actually attempted to track down, we would be much less reluctant to upgrade.

And then, but only then, would the latest and greatest features be of interest.

I will bring it up here again: with the complexity that the code base reached now, the 2 year release cadence is way too fast, it doesn't provide sufficient maturity for upgrading fast as well. More and more admins will be several cycles behind and we are reaching the point where major bugs in so-called EOL versions will only be discovered before large clusters even reached this version. Which might become a fundamental blocker to upgrades entirely.

An alternative to increasing the release cadence would be to keep more cycles in the life-time loop instead of only the last 2 major releases. 4 years really is nothing when it comes to storage.

Hope this is helpful and puts some light on the mystery why admins don't want to move.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Konstantin Shalygin <k0ste@xxxxxxxx>
Sent: Monday, May 15, 2023 10:43 AM
To: Tino Todino
Cc: ceph-users@xxxxxxx
Subject:  Re: CEPH Version choice

Hi,

On 15 May 2023, at 11:37, Tino Todino <tinot@xxxxxxxxxxxxxxxxx> wrote:

What are the main reasons for not upgrading to the latest and greatest?
One of the main reasons - "just can't", because your Ceph-based products will get worse at real (not benchmark) performance, see [1]

[1] https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/2E67NW6BEAVITL4WTAAU3DFLW7LJX477/

k
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx