Re: testing downgrades

Vasu Kulkarni <vakulkar@xxxxxxxxxx> · Thu, 6 Sep 2018 10:01:15 -0700

On Wed, Sep 5, 2018 at 1:30 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Tue, 4 Sep 2018, Vasu Kulkarni wrote:
>> On Tue, Sep 4, 2018 at 12:31 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> > There is no plan or expectation (on my part at least) to support
>> > downgrades across major versions (mimic -> luminous).  (IMO the complexity
>> > that would need to be introduced to allow this is not worth the
>> > investment.)
>> >
>> > However, there *is* an interest in testing (and being able to support)
>> > downgrades between minor versions.  For example, if you're running 13.2.2,
>> > and start rolling out 13.2.3 but things misbehave, it would be nice to be
>> > able to roll back to 13.2.2.
>>
>> I am personally -1 on downgrading the software for minor versions too,
>> If for some reason say 13.2.3 is not working on a specific system, I
>> think the ideal thing could be to stop rolling upgrade at that stage and
>> revert the node back to original state with minimal impact (couple osd's
>> in down state, until fresh install of 13.2.2 restores it)
>
> I don't think it is realistic to expect users to upgrade in a way that
> lets them catch any issues with new versions before they go beyond a
> single failure domain (e.g., one host or rack).  Some issues might
> not even manifest until they do.  And even if the did do a single failure
> domain and notice the issue, they would be stuck running with a degraded
> system (or have to do a big rebalance) until a proper fix is available.
for single failure domain, i  actually meant the same downgrade here, but in
different way, like uninstall and reinstall old version, restore
mondb, bring osd back
faster(like a os reinstall case)

If problem can be detected only after all upgrades, then its a different case.

>
> FWIW we already get flak from customers because we don't test and support
> this... I think this is a when, not an if.
>
>> I think adding more tests to cover upgrade scenarios rather than
>> downgrade cases will be more helpful.
>
> We should do that too.  Any feedback on what types of upgrade scenarios we
> should cover that we currently don't would be helpful...
I will look into current cases but mostly mix of
filestore/bluestore/ec for rgw/rbd/fs workloads
in continuous online mode. If some of them are outside upgrades
suites, probably we can
bring them in same suite.  with downgrades in picture test cases will be 2x.

>
> Thanks!
> sage
>
>
>>
>> >
>> > So.. what do we need to do to allow this?
>> >
>> > 1. Create a test suite that captures the downgrade cases.  We could start
>> > with a teuthology facet for the initial version and have another facet for
>> > the target version.  Teuthology can't enforce a strict ordering (i.e.,
>> > always a downgrade) but it's probably just as valuable to also test the
>> > upgrade cases too.  The main challenge I see here is that we are regularly
>> > fixing bugs in the stable releases; since the tests are against older
>> > releases, problems we uncover are often things that we can't "fix" since
>> > it's existing code.
>> >
>> > It will probably (?) be the case that in the end we have known issues with
>> > downgrades with specifics versions.
>> >
>> > What kind of workloads should we run?
>> >
>> > Should we repurpose the p2p suite to do this?  Right now it steps through
>> > every stable release in sequence.  Instead, we could add N facets that
>> > upgrade (or downgrade) between versions, and make sure that N >= the total
>> > number of point releases.  This would be more like an upgrade/downgrade
>> > "thrasher" in that case...
>> >
>> > 2. Consider downgrades when backporting any changes to stable releases.
>> > If we are adding fields to data structures, they need to work both in the
>> > upgrade case (which we're already good at) and the downgrade case.
>> > Usually the types of changes that make some behavior change happen in the
>> > major releases, but occasionally we make these changes in stable releases
>> > too.
>> >
>> > I can't actually think of a stable branch change that would be problematic
>> > right now... hopefully that's a good sign!
>> >
>> > Other thoughts?
>> > sage
>> >
>>
>>