Re: Ceph release cadence

Brady Deetz <bdeetz@xxxxxxxxx> · Sat, 23 Sep 2017 00:39:02 -0500

I'll be first to admit that most of my comments are anecdotal. But, I suspect when it comes to storage many of us don't require a lot to get scared back into our dark corners. In short it seems that the dev team should get better at selecting features and delivering on the existing scheduled cadence before shortening it. To me, the odd releases represent feature previews for the next even release. If that's a fair way to look at them, they could play a very important role in the stability of the even release.

On Sep 22, 2017 8:59 PM, "Sage Weil" <sage@xxxxxxxxxxxx> wrote:
On Fri, 22 Sep 2017, Gregory Farnum wrote:

> On Fri, Sep 22, 2017 at 3:28 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:

> > Here is a concrete proposal for everyone to summarily shoot down (or

> > heartily endorse, depending on how your friday is going):

> >

> > - 9 month cycle

> > - enforce a predictable release schedule with a freeze date and

> >   a release date.  (The actual .0 release of course depends on no blocker

> >   bugs being open; not sure how zealous 'train' style projects do

> >   this.)

>

> Train projects basically commit to a feature freeze enough in advance

> of the expected release date that it's feasible, and don't let people

> fake it by rushing in stuff they "finished" the day before. I'm not

> sure if every-9-month LTSes will be more conducive to that or not — if

> we do scheduled releases, we still fundamentally need to be able to

> say "nope, that feature we've been saying for 9 months we hope to have

> out in this LTS won't make it until the next one". And we seem pretty

> bad at that.

I'll be the first to say I'm no small part of the "we" there.  But I'm

also suggesting that's not a reason not to try to do better.  As I

said I think this will be easier than in the past because we don't

have as many headline features we're trying to wedge in.

That's excellent as long as it actually happens. Otherwise the collective you may end up pushing worse code on a 9mo cycle than the current theoretical 12mo cycle that is delayed when necessary. We all know that software development never happens on time or on budget.

In any case, is there an alternative way to get to the much-desired

regular cadence?

> > - no more even/odd pattern; all stable releases are created equal.

> > - support upgrades from up to 3 releases back.

> >

> > This shortens the cycle a bit to relieve the "this feature must go in"

> > stress, without making it so short as to make the release pointless (e.g.,

> > infernalis, kraken).  (I also think that the feature pressure is much

> > lower now than it has been in the past.)

> >

> > This creates more work for the developers because there are more upgrade

> > paths to consider: we no longer have strict "choke points" (like all

> > upgrades must go through luminous).  We could reserve the option to pick

> > specific choke point releases in the future, perhaps taking care to make

> > sure these are the releases that go into downstream distros.  We'll need

> > to be more systematic about the upgrade testing.

>

> This sounds generally good to me — we did multiple-release upgrades

> for a long time, and stuff is probably more complicated now but I

> don't think it will actually be that big a deal.

>

> 3 releases back might be a bit much though — that's 27 months! (For

> luminous, the beginning of 2015. Hammer.)

I'm *much* happier with 2 :) so no complaint from me.  I just heard a lot

of "2 years" and 2 releases (18 months) doesn't quite cover it.  Maybe

it's best to start with that, though?  It's still an improvement over the

current ~12 months.

A lot of vulnerabilities and bugs can come out in one year. As such, I upgrade anything in my environment, at minimum, once a year. The "if it ain't broke don't fix it" mentality is usually more dangerous than an upgrade between minor releases. But... I will say that as my Ceph environment grows, upgrades become increasingly difficult to manage and anxiety increases with every node I add to my growing 2PB cluster.

> > Somewhat separately, several people expressed concern about having stable

> > releases to develop against.  This is somewhat orthogonal to what users

> > need.  To that end, we can do a dev checkpoint every 1 or 2 months

> > (preferences?), where we fork a 'next' branch and stabilize all of the

> > tests before moving on.  This is good practice anyway to avoid

> > accumulating low-frequency failures in the test suite that have to be

> > squashed at the end.

>

> So this sounds like a fine idea to me, but how do we distinguish this

> from the intermediate stable releases?

>

> By which I mean, are we *really* going to do a stabilization branch

> that will never get seen by users? What kind of testing and bug fixing

> are we going to commit to doing against it, and how do we balance that

> effort with feature work?

>

> It seems like the same conflict we have now, only since the dev

> checkpoints are less important they'll lose more often. Then we'll end

> up having 9 months of scheduled work to debug for a user release

> instead of 5 months that slipped to 7 or 8...

What if we frame this stabilization period in terms of stability of the

test suite.  That gives us something concrete to aim for, lets us move on

when we reach some threshold, and aligns perfectly with the thing that

makes it hard to safely land new code (noisy test results)...

All of the text below can be summarized as me saying a 9mo release cycle is reasonable as long as a parallel focus of delivering rock solid code is as important as meeting an arbitrary deadline. The fact that uncertainty plays a significant role in people holding back on upgrades from hammer should raise a flag. 

Time for a story from a cephfs jewel operator:
About a year ago, I watched a web cast you did demonstrating the dev process for ceph where you patched a bug, ran the test suite, and packaged the patch live. I was relieved to see that the Ceph process was organized and not a complete wild-west after I had just deployed a little over 1PB of cephfs.

Not long after that webcast, I very easily triggered a bug in Jewel that crashed both of my MDS servers (active/standby) because I made a very simple mistake by deleting a pool defined in the ceph.dir.layout xattr of a cephfs directory. A stupid mistake, yes. But a very very easy mistake to make. John Spray managed to track down the bug; but it still resulted in a 12 hour outage and there was absolutely nothing I could have done to resolve the issue other than reverse time or patch the MDS code myself (unlikely to happen quickly). I'm forever in debt to John for his work that day. 

It's precisely that experience that emotionally makes me want to wait for an upgrade to luminous. I desperately want Bluestore, but I also like my data and don't feel like restoring my entire cluster from tape because of a bug some other poor soul could trigger before me. 

As such, features are excellent; I'd do many things to have cephfs ec-coded direct writes arrive on my cluster tomorrow. But, I'm willing to wait in the name of stability and confidence. As such, bumping the release schedule from 12 to 9 months seems like it will just rush devs to "finalize" code for the major releases. 

sage
_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com