Re: testing downgrades

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 6 Sep 2018 15:01:07 -0700

On Tue, Sep 4, 2018 at 12:31 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> There is no plan or expectation (on my part at least) to support
> downgrades across major versions (mimic -> luminous).  (IMO the complexity
> that would need to be introduced to allow this is not worth the
> investment.)
>
> However, there *is* an interest in testing (and being able to support)
> downgrades between minor versions.  For example, if you're running 13.2.2,
> and start rolling out 13.2.3 but things misbehave, it would be nice to be
> able to roll back to 13.2.2.
>
> So.. what do we need to do to allow this?
>
> 1. Create a test suite that captures the downgrade cases.  We could start
> with a teuthology facet for the initial version and have another facet for
> the target version.  Teuthology can't enforce a strict ordering (i.e.,
> always a downgrade) but it's probably just as valuable to also test the
> upgrade cases too.  The main challenge I see here is that we are regularly
> fixing bugs in the stable releases; since the tests are against older
> releases, problems we uncover are often things that we can't "fix" since
> it's existing code.
>
> It will probably (?) be the case that in the end we have known issues with
> downgrades with specifics versions.
>
> What kind of workloads should we run?
>
> Should we repurpose the p2p suite to do this?  Right now it steps through
> every stable release in sequence.  Instead, we could add N facets that
> upgrade (or downgrade) between versions, and make sure that N >= the total
> number of point releases.  This would be more like an upgrade/downgrade
> "thrasher" in that case...

It may just be bias in my recent thoughts, but it seems like the most
valuable thing to do here is to generate the encoded structures from
each rc and make sure the previous version of the code can read them.
In other words, generate a ceph-object-corpus for each existing
release and proposed rc (which I've been planning to start pushing on,
but haven't yet, so I don't even remember how we generate them!), then
run the encode and decode with the *old* software.

There may be a few other categories of bugs we can find and detect
with wire protocol sorts of things by running downgrade workloads, but
I think the existing upgrade tests are actually not super likely to
trigger these issues if we simply turn them around? Whereas issues
with wire protocols or other ephemeral things can be worked around by
turning the cluster off, missing something with the disk state means
you can't downgrade at all.
-Greg

>
> 2. Consider downgrades when backporting any changes to stable releases.
> If we are adding fields to data structures, they need to work both in the
> upgrade case (which we're already good at) and the downgrade case.
> Usually the types of changes that make some behavior change happen in the
> major releases, but occasionally we make these changes in stable releases
> too.
>
> I can't actually think of a stable branch change that would be problematic
> right now... hopefully that's a good sign!
>
> Other thoughts?
> sage
>