Re: CEPH Version choice

Frank Schilder <frans@xxxxxx> · Mon, 15 May 2023 15:53:45 +0000

Hi all, to avoid a potentially wrong impression I would like to add some words.

Slightly out of order:

> By the way, regarding performance I recommend the Cephalocon
> presentations by Adam and Mark. There you can learn what efforts are
> made to improve ceph performance for current and future versions.

For me personally the ceph performance degradation due to removing the WAL re-use is not a problem as it is predictable and the reasons for removing it are solid. A bit more worrying is the degradation over time and I know that there is work spent on it and it is expensive to debug, because collecting data takes so long. I mentioned performance mainly because there was at least one other user who explicitly called this a show-stopper.

I appreciate the effort put into this and am not complaining about a lack of effort. What I am complaining about is that this effort is under unnecessary pressure due to the short release cadence. 2 years are not much to mature for a system like ceph and starting to count with the .2 release seems a bit premature given recent experience.

> I know the problems that Frank has raised. However, it should also be
> mentioned that many critical bugs have been fixed in the major versions.
> We are working on the fixes ourselves.

Again, I know all this and I very much appreciate it. I still consider the voluntary ceph support as better than support we got for enterprise systems. I got a lot of invaluable help from Igor during my upgrade experience and I got some important stuff fixed by Xiubo recently. Just to repeat it though, I am convinced one could reach much higher maturity with less time pressure - and maybe less often forget this one critical PR that causes so much trouble for some users.

> However, our goal is always to use Ceph versions that still get
> backports and on the other hand, only use the features we really need.
> Our developers also always aim to bring bug fixes upstream and into the
> supported versions.

I would love to, but the speed of counting versions up is too fast and the problems trying to keep up are a bit too much for a one-man army. After last time I have a hard time now convincing users to take another possible hit.

If there was a bit more time to take a breath after the last upgrade, I would probably be able to do it. However, with my current experience I look at not being able to catch up for the time being. We might even fall further behind. Which is a shame because we operate a large installation and tend to discover relevant bugs that don't show up in smaller systems with less load.

My wish and hypothesis simply are that if we would reduce the speed of major release cycles, a lot more operators would be able to follow and the releases and upgrade procedures would be significantly more stable just for the fact that more clusters are continuously closer to latest.

For example, don't start counting the life time with the release of the .2 release of a major version, start when 50% of the top-10 (25/50) sized clusters in telemetry are on or above that version and declare a version EOL if 90% of these clusters are at a newer major release. This would give a much better indication of the perceived maturity of major releases by operators of significant installations. It would also give an incentive to submit data to telemetry as well as helping the last few % over the upgrade hurdle to be able to declare a version as EOL.

Kind of a community where the ones who fall behind get an extra helping hand so everyone can move on.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Joachim Kraftmayer - ceph ambassador <joachim.kraftmayer@xxxxxxxxx>
Sent: Monday, May 15, 2023 4:34 PM
To: Frank Schilder; Tino Todino
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: CEPH Version choice

Hi,

I know the problems that Frank has raised. However, it should also be
mentioned that many critical bugs have been fixed in the major versions.
We are working on the fixes ourselves.

We and others have written a lot of tools for ourselves in the last 10
years to improve migration/update and upgrade paths/strategy.

 From version to version, we also test for up to 6 months before putting
them into production.

However, our goal is always to use Ceph versions that still get
backports and on the other hand, only use the features we really need.
Our developers also always aim to bring bug fixes upstream and into the
supported versions.

By the way, regarding performance I recommend the Cephalocon
presentations by Adam and Mark. There you can learn what efforts are
made to improve ceph performance for current and future versions.

Regards, Joachim

___________________________________
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 15.05.23 um 12:11 schrieb Frank Schilder:
>> What are the main reasons for not upgrading to the latest and greatest?
> Because more often than not it isn't.
>
> I guess when you write "latest and greatest" you talk about features. When we admins talk about "latest and greatest" we talk about stability. The times that one could jump with a production system onto a "stable" release with the ending .2 are long gone. Anyone who becomes an early adapter is more and more likely to experience serious issues. Which leads to more admins waiting with upgrades. Which in turn leads to more bugs discovered only at late releases. Which again makes more admins postpone an upgrade. A vicious cycle.
>
> A long time ago there was a discussion about exactly this problem and the admins were pretty much in favor of increasing the release cadence to at least 4 years if not longer. Its simply too many releases with too many serious bugs not fixed, lately not even during their official life time. Octopus still has serious bugs but is EOL.
>
> I'm not surprised that admins give up on upgrading entirely and stay on a version until their system dies.
>
> To give you one from my own experience, upgrading from mimic latest to octopus latest. This experience almost certainly applies to every upgrade that involves an OSD format change (the infamous "quick fix" that could take several days per OSD and crush entire clusters).
>
> There is an OSD conversion involved in this upgrade and we found out that out of 2 possible upgrade paths, one leads to a heavily performance degraded cluster with no possibility to recover other than redeploying all OSDs step by step. Funnily enough, the problematic procedure is the one described in the documentation - it hasn't been updated until today despite users still getting caught in this trap.
>
> To give you an idea of what amount of work is now involved in an attempt to avoid such pitfalls, here our path:
>
> We set up a test cluster with a script producing realistic workload and started testing an upgrade under load. This took about a month (meaning repeating the upgrade with a cluster on mimic deployed and populated from scratch every time) to confirm that we managed to get onto a robust path avoiding a number of pitfalls along the way - mainly the serious performance degradation due to OSD conversion, but also an issue with stray entries plus noise. A month! Once we were convinced that it would work - meaning we did run it a couple of times without any further issues being discovered, we started upgrading our production cluster.
>
> Went smooth until we started the OSD conversion of our FS meta data OSDs. They had a special performance optimized deployment resulting in a large number of 100G OSDs with about 30-40% utilization. These OSDs started crashing with some weird corruption. Turns out - thanks Igor! - that while spill-over from fast to slow drive was handled, the other direction was not. Our OSDs crashed because Octopus apparently required substantially more space on the slow device and couldn't use the plenty of fast space that was actually available.
>
> The whole thing ended in 3 days of complete downtime and me working 12 hour days on the weekend. We managed to recover from this only because we had a larger delivery of hardware already on-site and I could scavenge parts from there.
>
> So, the story was that after 1 month of testing we still run into 3 days of downtime, because there was another unannounced change that broke a config that was working fine for years on mimic.
>
> To say the same thing with different words: major version upgrades have become very disruptive and require a lot of effort to get halfway right. And I'm not talking about the deployment system here.
>
> Add to this list the still open cases discussed on the list about MDS dentry corruption, snapshots disappearing/corrupting together with a lack of good built-in tools for detection and repair, performance degradation etc. all not even addressed in pacific. In this state the devs are pushing for pacific becoming EOL while at the same time the admins become ever more reluctant to upgrade.
>
> In my specific case, I planned to upgrade at least to pacific this year, but my time budget simply doesn't allow for the verification of the procedure and checking that all for us relevant bugs have been addressed. I gave up. Maybe next year. Maybe then its even a bit closer to rock solid.
>
> So to get back to my starting point, we admins actually value rock solid over features. I know that this is boring for devs, but nothing is worse than nobody using your latest and greatest - which probably was the motivation for your question. If the upgrade paths were more solid and things like the question "why does an OSD conversion not lead to an OSD that is identical to one deployed freshly" or "where does the performance go" would actually attempted to track down, we would be much less reluctant to upgrade.
>
> And then, but only then, would the latest and greatest features be of interest.
>
> I will bring it up here again: with the complexity that the code base reached now, the 2 year release cadence is way too fast, it doesn't provide sufficient maturity for upgrading fast as well. More and more admins will be several cycles behind and we are reaching the point where major bugs in so-called EOL versions will only be discovered before large clusters even reached this version. Which might become a fundamental blocker to upgrades entirely.
>
> An alternative to increasing the release cadence would be to keep more cycles in the life-time loop instead of only the last 2 major releases. 4 years really is nothing when it comes to storage.
>
> Hope this is helpful and puts some light on the mystery why admins don't want to move.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Konstantin Shalygin <k0ste@xxxxxxxx>
> Sent: Monday, May 15, 2023 10:43 AM
> To: Tino Todino
> Cc: ceph-users@xxxxxxx
> Subject:  Re: CEPH Version choice
>
> Hi,
>
>> On 15 May 2023, at 11:37, Tino Todino <tinot@xxxxxxxxxxxxxxxxx> wrote:
>>
>> What are the main reasons for not upgrading to the latest and greatest?
> One of the main reasons - "just can't", because your Ceph-based products will get worse at real (not benchmark) performance, see [1]
>
>
> [1] https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/2E67NW6BEAVITL4WTAAU3DFLW7LJX477/
>
>
> k
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx