Re: testing downgrades

Vasu Kulkarni <vakulkar@xxxxxxxxxx> · Tue, 4 Sep 2018 13:11:51 -0700

On Tue, Sep 4, 2018 at 1:01 PM, Alfredo Deza <adeza@xxxxxxxxxx> wrote:
> On Tue, Sep 4, 2018 at 3:41 PM, Vasu Kulkarni <vakulkar@xxxxxxxxxx> wrote:
>> On Tue, Sep 4, 2018 at 12:31 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>> There is no plan or expectation (on my part at least) to support
>>> downgrades across major versions (mimic -> luminous).  (IMO the complexity
>>> that would need to be introduced to allow this is not worth the
>>> investment.)
>>>
>>> However, there *is* an interest in testing (and being able to support)
>>> downgrades between minor versions.  For example, if you're running 13.2.2,
>>> and start rolling out 13.2.3 but things misbehave, it would be nice to be
>>> able to roll back to 13.2.2.
>>
>> I am personally -1 on downgrading the software for minor versions too,
>> If for some
>> reason say 13.2.3 is not working on a specific system, I think the
>> ideal thing could be to
>> stop rolling upgrade at that stage and revert the node back to
>> original state with minimal impact
>> (couple osd's in down state, until fresh install of 13.2.2 restores it)
>
> That is assuming you are able to detect a problem while upgrading.
> What if you actually need to start using the cluster to detect a
> problem that would benefit from a downgrade?
In high availability configurations(which is mostly the case) you are
always using
the cluster even during upgrades, In case of minor upgrades it is much easier
since the osd compatibility is not big issue like jewel -> luminous etc.

The work will be for orchestration layer to ensure rolling upgrade is
working properly.

>
>>
>> I think adding more tests to cover upgrade scenarios rather than
>> downgrade cases will be
>> more helpful.
>
> That sounds like a great idea always
>>
>>>
>>> So.. what do we need to do to allow this?
>>>
>>> 1. Create a test suite that captures the downgrade cases.  We could start
>>> with a teuthology facet for the initial version and have another facet for
>>> the target version.  Teuthology can't enforce a strict ordering (i.e.,
>>> always a downgrade) but it's probably just as valuable to also test the
>>> upgrade cases too.  The main challenge I see here is that we are regularly
>>> fixing bugs in the stable releases; since the tests are against older
>>> releases, problems we uncover are often things that we can't "fix" since
>>> it's existing code.
>>>
>>> It will probably (?) be the case that in the end we have known issues with
>>> downgrades with specifics versions.
>>>
>>> What kind of workloads should we run?
>>>
>>> Should we repurpose the p2p suite to do this?  Right now it steps through
>>> every stable release in sequence.  Instead, we could add N facets that
>>> upgrade (or downgrade) between versions, and make sure that N >= the total
>>> number of point releases.  This would be more like an upgrade/downgrade
>>> "thrasher" in that case...
>>>
>>> 2. Consider downgrades when backporting any changes to stable releases.
>>> If we are adding fields to data structures, they need to work both in the
>>> upgrade case (which we're already good at) and the downgrade case.
>>> Usually the types of changes that make some behavior change happen in the
>>> major releases, but occasionally we make these changes in stable releases
>>> too.
>>>
>>> I can't actually think of a stable branch change that would be problematic
>>> right now... hopefully that's a good sign!
>>>
>>> Other thoughts?
>>> sage
>>>