Re: testing downgrades

Alfredo Deza <adeza@xxxxxxxxxx> · Tue, 4 Sep 2018 16:01:05 -0400

On Tue, Sep 4, 2018 at 3:41 PM, Vasu Kulkarni <vakulkar@xxxxxxxxxx> wrote:
> On Tue, Sep 4, 2018 at 12:31 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> There is no plan or expectation (on my part at least) to support
>> downgrades across major versions (mimic -> luminous).  (IMO the complexity
>> that would need to be introduced to allow this is not worth the
>> investment.)
>>
>> However, there *is* an interest in testing (and being able to support)
>> downgrades between minor versions.  For example, if you're running 13.2.2,
>> and start rolling out 13.2.3 but things misbehave, it would be nice to be
>> able to roll back to 13.2.2.
>
> I am personally -1 on downgrading the software for minor versions too,
> If for some
> reason say 13.2.3 is not working on a specific system, I think the
> ideal thing could be to
> stop rolling upgrade at that stage and revert the node back to
> original state with minimal impact
> (couple osd's in down state, until fresh install of 13.2.2 restores it)

That is assuming you are able to detect a problem while upgrading.
What if you actually need to start using the cluster to detect a
problem that would benefit from a downgrade?

>
> I think adding more tests to cover upgrade scenarios rather than
> downgrade cases will be
> more helpful.

That sounds like a great idea always
>
>>
>> So.. what do we need to do to allow this?
>>
>> 1. Create a test suite that captures the downgrade cases.  We could start
>> with a teuthology facet for the initial version and have another facet for
>> the target version.  Teuthology can't enforce a strict ordering (i.e.,
>> always a downgrade) but it's probably just as valuable to also test the
>> upgrade cases too.  The main challenge I see here is that we are regularly
>> fixing bugs in the stable releases; since the tests are against older
>> releases, problems we uncover are often things that we can't "fix" since
>> it's existing code.
>>
>> It will probably (?) be the case that in the end we have known issues with
>> downgrades with specifics versions.
>>
>> What kind of workloads should we run?
>>
>> Should we repurpose the p2p suite to do this?  Right now it steps through
>> every stable release in sequence.  Instead, we could add N facets that
>> upgrade (or downgrade) between versions, and make sure that N >= the total
>> number of point releases.  This would be more like an upgrade/downgrade
>> "thrasher" in that case...
>>
>> 2. Consider downgrades when backporting any changes to stable releases.
>> If we are adding fields to data structures, they need to work both in the
>> upgrade case (which we're already good at) and the downgrade case.
>> Usually the types of changes that make some behavior change happen in the
>> major releases, but occasionally we make these changes in stable releases
>> too.
>>
>> I can't actually think of a stable branch change that would be problematic
>> right now... hopefully that's a good sign!
>>
>> Other thoughts?
>> sage
>>