Re: testing downgrades

John Spray <jspray@xxxxxxxxxx> · Tue, 4 Sep 2018 21:15:46 +0100

On Tue, Sep 4, 2018 at 8:31 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> There is no plan or expectation (on my part at least) to support
> downgrades across major versions (mimic -> luminous).  (IMO the complexity
> that would need to be introduced to allow this is not worth the
> investment.)
>
> However, there *is* an interest in testing (and being able to support)
> downgrades between minor versions.  For example, if you're running 13.2.2,
> and start rolling out 13.2.3 but things misbehave, it would be nice to be
> able to roll back to 13.2.2.

Related thought: if we start supporting+testing downgrades as
suggested (which sounds wise to me), then there'll be a higher review
burden to check that the proposed patches are downgrade-safe -- that
will be a lot easier if reviewers are only having to look at
essential/minimal bug fixes.  Currently we wave through backports of
code that worked well on master, but for downgrade safety we'll have
to look more closely.

So... perhaps we should perhaps also consider locking down the stable
branches more strictly? Currently we're pretty liberal about backports
that are not quite essential bug fixes.  I've certainly been one of
the people guilty of sneaking things into stable branches because I
wanted them there, but now that we have the 9 month major release
cadence, the urge to do that is subsiding.  We currently have the
dashboard backport to luminous in flight, perhaps a new policy could
come into place when we enter the next major release cycle.

We could still have certain caveats for cases that can't affect
downgrades: the main one that springs to mind is to support non-bugfix
backports as long as they're to components that do no persistence,
that would enable things like continuously improving modules like
prometheus.

John

>
> So.. what do we need to do to allow this?
>
> 1. Create a test suite that captures the downgrade cases.  We could start
> with a teuthology facet for the initial version and have another facet for
> the target version.  Teuthology can't enforce a strict ordering (i.e.,
> always a downgrade) but it's probably just as valuable to also test the
> upgrade cases too.  The main challenge I see here is that we are regularly
> fixing bugs in the stable releases; since the tests are against older
> releases, problems we uncover are often things that we can't "fix" since
> it's existing code.
>
> It will probably (?) be the case that in the end we have known issues with
> downgrades with specifics versions.
>
> What kind of workloads should we run?
>
> Should we repurpose the p2p suite to do this?  Right now it steps through
> every stable release in sequence.  Instead, we could add N facets that
> upgrade (or downgrade) between versions, and make sure that N >= the total
> number of point releases.  This would be more like an upgrade/downgrade
> "thrasher" in that case...
>
> 2. Consider downgrades when backporting any changes to stable releases.
> If we are adding fields to data structures, they need to work both in the
> upgrade case (which we're already good at) and the downgrade case.
> Usually the types of changes that make some behavior change happen in the
> major releases, but occasionally we make these changes in stable releases
> too.
>
> I can't actually think of a stable branch change that would be problematic
> right now... hopefully that's a good sign!
>
> Other thoughts?
> sage
>