Re: testing downgrades

Sage Weil <sweil@xxxxxxxxxx> · Thu, 6 Sep 2018 22:15:02 +0000 (UTC)

On Thu, 6 Sep 2018, Gregory Farnum wrote:
> On Tue, Sep 4, 2018 at 12:31 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > There is no plan or expectation (on my part at least) to support
> > downgrades across major versions (mimic -> luminous).  (IMO the complexity
> > that would need to be introduced to allow this is not worth the
> > investment.)
> >
> > However, there *is* an interest in testing (and being able to support)
> > downgrades between minor versions.  For example, if you're running 13.2.2,
> > and start rolling out 13.2.3 but things misbehave, it would be nice to be
> > able to roll back to 13.2.2.
> >
> > So.. what do we need to do to allow this?
> >
> > 1. Create a test suite that captures the downgrade cases.  We could start
> > with a teuthology facet for the initial version and have another facet for
> > the target version.  Teuthology can't enforce a strict ordering (i.e.,
> > always a downgrade) but it's probably just as valuable to also test the
> > upgrade cases too.  The main challenge I see here is that we are regularly
> > fixing bugs in the stable releases; since the tests are against older
> > releases, problems we uncover are often things that we can't "fix" since
> > it's existing code.
> >
> > It will probably (?) be the case that in the end we have known issues with
> > downgrades with specifics versions.
> >
> > What kind of workloads should we run?
> >
> > Should we repurpose the p2p suite to do this?  Right now it steps through
> > every stable release in sequence.  Instead, we could add N facets that
> > upgrade (or downgrade) between versions, and make sure that N >= the total
> > number of point releases.  This would be more like an upgrade/downgrade
> > "thrasher" in that case...
> 
> It may just be bias in my recent thoughts, but it seems like the most
> valuable thing to do here is to generate the encoded structures from
> each rc and make sure the previous version of the code can read them.
> In other words, generate a ceph-object-corpus for each existing
> release and proposed rc (which I've been planning to start pushing on,
> but haven't yet, so I don't even remember how we generate them!), then
> run the encode and decode with the *old* software.

Yes!!  I like this because it also kills a couple birds with one stone: if 
we automate the process of generating corpus objects then we can also make 
sure we're covering all the usual upgrade cases too.

The challenge is that generating corpus objects means a custom 
build and then running a sufficiently broad set of workloads to 
instantiate as many different and interesting object instances as 
possible.  The

//#define ENCODE_DUMP_PATH /tmp/something

needs to be defined, everything built, and then the system run with a 
bunch of workloads.  This fills /tmp/something will a bazillion object 
instances.  There are then some scripts that dedup and try to pick out 
a sample with varying sizes etc.

I don't have any bright ideas on how to do that easily and in an automated 
way, though.. we presumably want to do it right before release to make 
sure everythign is kosher (to ensure compat with everything in teh 
corpus), and also generate objects on actual releases (or builds of the 
same sha1 + the above #define) to populate the corpus with that release.

:/

> There may be a few other categories of bugs we can find and detect
> with wire protocol sorts of things by running downgrade workloads, but
> I think the existing upgrade tests are actually not super likely to
> trigger these issues if we simply turn them around? Whereas issues
> with wire protocols or other ephemeral things can be worked around by
> turning the cluster off, missing something with the disk state means
> you can't downgrade at all.
> -Greg
> 
> >
> > 2. Consider downgrades when backporting any changes to stable releases.
> > If we are adding fields to data structures, they need to work both in the
> > upgrade case (which we're already good at) and the downgrade case.
> > Usually the types of changes that make some behavior change happen in the
> > major releases, but occasionally we make these changes in stable releases
> > too.
> >
> > I can't actually think of a stable branch change that would be problematic
> > right now... hopefully that's a good sign!
> >
> > Other thoughts?
> > sage
> >
> 
>