Re: testing downgrades

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 6 Sep 2018 15:26:49 -0700

On Thu, Sep 6, 2018 at 3:15 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Thu, 6 Sep 2018, Gregory Farnum wrote:
>> On Tue, Sep 4, 2018 at 12:31 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> > There is no plan or expectation (on my part at least) to support
>> > downgrades across major versions (mimic -> luminous).  (IMO the complexity
>> > that would need to be introduced to allow this is not worth the
>> > investment.)
>> >
>> > However, there *is* an interest in testing (and being able to support)
>> > downgrades between minor versions.  For example, if you're running 13.2.2,
>> > and start rolling out 13.2.3 but things misbehave, it would be nice to be
>> > able to roll back to 13.2.2.
>> >
>> > So.. what do we need to do to allow this?
>> >
>> > 1. Create a test suite that captures the downgrade cases.  We could start
>> > with a teuthology facet for the initial version and have another facet for
>> > the target version.  Teuthology can't enforce a strict ordering (i.e.,
>> > always a downgrade) but it's probably just as valuable to also test the
>> > upgrade cases too.  The main challenge I see here is that we are regularly
>> > fixing bugs in the stable releases; since the tests are against older
>> > releases, problems we uncover are often things that we can't "fix" since
>> > it's existing code.
>> >
>> > It will probably (?) be the case that in the end we have known issues with
>> > downgrades with specifics versions.
>> >
>> > What kind of workloads should we run?
>> >
>> > Should we repurpose the p2p suite to do this?  Right now it steps through
>> > every stable release in sequence.  Instead, we could add N facets that
>> > upgrade (or downgrade) between versions, and make sure that N >= the total
>> > number of point releases.  This would be more like an upgrade/downgrade
>> > "thrasher" in that case...
>>
>> It may just be bias in my recent thoughts, but it seems like the most
>> valuable thing to do here is to generate the encoded structures from
>> each rc and make sure the previous version of the code can read them.
>> In other words, generate a ceph-object-corpus for each existing
>> release and proposed rc (which I've been planning to start pushing on,
>> but haven't yet, so I don't even remember how we generate them!), then
>> run the encode and decode with the *old* software.
>
> Yes!!  I like this because it also kills a couple birds with one stone: if
> we automate the process of generating corpus objects then we can also make
> sure we're covering all the usual upgrade cases too.
>
> The challenge is that generating corpus objects means a custom
> build and then running a sufficiently broad set of workloads to
> instantiate as many different and interesting object instances as
> possible.  The
>
> //#define ENCODE_DUMP_PATH /tmp/something
>
> needs to be defined, everything built, and then the system run with a
> bunch of workloads.  This fills /tmp/something will a bazillion object
> instances.  There are then some scripts that dedup and try to pick out
> a sample with varying sizes etc.
>
> I don't have any bright ideas on how to do that easily and in an automated
> way, though.. we presumably want to do it right before release to make
> sure everythign is kosher (to ensure compat with everything in teh
> corpus), and also generate objects on actual releases (or builds of the
> same sha1 + the above #define) to populate the corpus with that release.

So...why do we actually need to run a cluster to generate these? Is it
in fact infeasible to automatically generate the data? Was it just too
annoying at the time ceph-object-corpus was set up? I haven't examined
it in detail but we already include simple ones in the
generate_test_instances stuff for everything in
src/tools/ceph-dencoder/types.h, though I can certainly imagine that
these might not be good enough since a lot of them are just
default-initialized. (Do we catch examples from anything that *isn't*
specified in that file?)

Something like running a cluster through the final set of teuthology
suites set up this way might be the best solution, but I wonder if
this was investigated and decided on or just that it worked once upon
a time. ;)
-Greg