Re: testing downgrades

Sage Weil <sweil@xxxxxxxxxx> · Wed, 5 Sep 2018 20:30:20 +0000 (UTC)

On Tue, 4 Sep 2018, Vasu Kulkarni wrote:
> On Tue, Sep 4, 2018 at 12:31 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > There is no plan or expectation (on my part at least) to support
> > downgrades across major versions (mimic -> luminous).  (IMO the complexity
> > that would need to be introduced to allow this is not worth the
> > investment.)
> >
> > However, there *is* an interest in testing (and being able to support)
> > downgrades between minor versions.  For example, if you're running 13.2.2,
> > and start rolling out 13.2.3 but things misbehave, it would be nice to be
> > able to roll back to 13.2.2.
> 
> I am personally -1 on downgrading the software for minor versions too, 
> If for some reason say 13.2.3 is not working on a specific system, I 
> think the ideal thing could be to stop rolling upgrade at that stage and 
> revert the node back to original state with minimal impact (couple osd's 
> in down state, until fresh install of 13.2.2 restores it)

I don't think it is realistic to expect users to upgrade in a way that 
lets them catch any issues with new versions before they go beyond a 
single failure domain (e.g., one host or rack).  Some issues might 
not even manifest until they do.  And even if the did do a single failure 
domain and notice the issue, they would be stuck running with a degraded 
system (or have to do a big rebalance) until a proper fix is available.

FWIW we already get flak from customers because we don't test and support 
this... I think this is a when, not an if.

> I think adding more tests to cover upgrade scenarios rather than 
> downgrade cases will be more helpful.

We should do that too.  Any feedback on what types of upgrade scenarios we 
should cover that we currently don't would be helpful...

Thanks!
sage

> 
> >
> > So.. what do we need to do to allow this?
> >
> > 1. Create a test suite that captures the downgrade cases.  We could start
> > with a teuthology facet for the initial version and have another facet for
> > the target version.  Teuthology can't enforce a strict ordering (i.e.,
> > always a downgrade) but it's probably just as valuable to also test the
> > upgrade cases too.  The main challenge I see here is that we are regularly
> > fixing bugs in the stable releases; since the tests are against older
> > releases, problems we uncover are often things that we can't "fix" since
> > it's existing code.
> >
> > It will probably (?) be the case that in the end we have known issues with
> > downgrades with specifics versions.
> >
> > What kind of workloads should we run?
> >
> > Should we repurpose the p2p suite to do this?  Right now it steps through
> > every stable release in sequence.  Instead, we could add N facets that
> > upgrade (or downgrade) between versions, and make sure that N >= the total
> > number of point releases.  This would be more like an upgrade/downgrade
> > "thrasher" in that case...
> >
> > 2. Consider downgrades when backporting any changes to stable releases.
> > If we are adding fields to data structures, they need to work both in the
> > upgrade case (which we're already good at) and the downgrade case.
> > Usually the types of changes that make some behavior change happen in the
> > major releases, but occasionally we make these changes in stable releases
> > too.
> >
> > I can't actually think of a stable branch change that would be problematic
> > right now... hopefully that's a good sign!
> >
> > Other thoughts?
> > sage
> >
> 
>