Re: testing downgrades

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 6 Sep 2018 15:52:02 -0700

On Thu, Sep 6, 2018 at 3:34 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Thu, 6 Sep 2018, Gregory Farnum wrote:
>> >> It may just be bias in my recent thoughts, but it seems like the most
>> >> valuable thing to do here is to generate the encoded structures from
>> >> each rc and make sure the previous version of the code can read them.
>> >> In other words, generate a ceph-object-corpus for each existing
>> >> release and proposed rc (which I've been planning to start pushing on,
>> >> but haven't yet, so I don't even remember how we generate them!), then
>> >> run the encode and decode with the *old* software.
>> >
>> > Yes!!  I like this because it also kills a couple birds with one stone: if
>> > we automate the process of generating corpus objects then we can also make
>> > sure we're covering all the usual upgrade cases too.
>> >
>> > The challenge is that generating corpus objects means a custom
>> > build and then running a sufficiently broad set of workloads to
>> > instantiate as many different and interesting object instances as
>> > possible.  The
>> >
>> > //#define ENCODE_DUMP_PATH /tmp/something
>> >
>> > needs to be defined, everything built, and then the system run with a
>> > bunch of workloads.  This fills /tmp/something will a bazillion object
>> > instances.  There are then some scripts that dedup and try to pick out
>> > a sample with varying sizes etc.
>> >
>> > I don't have any bright ideas on how to do that easily and in an automated
>> > way, though.. we presumably want to do it right before release to make
>> > sure everythign is kosher (to ensure compat with everything in teh
>> > corpus), and also generate objects on actual releases (or builds of the
>> > same sha1 + the above #define) to populate the corpus with that release.
>>
>> So...why do we actually need to run a cluster to generate these? Is it
>> in fact infeasible to automatically generate the data? Was it just too
>> annoying at the time ceph-object-corpus was set up? I haven't examined
>> it in detail but we already include simple ones in the
>> generate_test_instances stuff for everything in
>> src/tools/ceph-dencoder/types.h, though I can certainly imagine that
>> these might not be good enough since a lot of them are just
>> default-initialized. (Do we catch examples from anything that *isn't*
>> specified in that file?)
>>
>> Something like running a cluster through the final set of teuthology
>> suites set up this way might be the best solution, but I wonder if
>> this was investigated and decided on or just that it worked once upon
>> a time. ;)
>
> The problem is that those generate_test_instances are (1) sparse and
> minimal, and (2) would require huge developer investment to fill in with
> "realistic" instances, and (3) even though wouldn't necessarily be
> representative of what happens in real life.  The ENCODE_DUMP_PATH thing
> collects actual objects from a real cluster with a real workload so that
> you can get a "real" sampling.
>
> The hard part is mostly generating a worklaod with good coverage (rados
> API tests, cls tests, etc are a good start for RADOS; for cephfs we need
> to multi-mds to cover all of the subtree migration related types; for rgw
> we'll want to do get coverage for the multisite stuff, etc etc.).
>
> That's a bit of work, but it's still much less work than hand-crafting
> object instances that may or may not be "real".

Yeah, that makes sense. But turning that around, why not just grab and
sample from the OSD and monitor disk stores after our existing
teuthology runs happen? Does the ENCODE_DUMP_PATH stuff also include
wire protocol messages that don't get put on disk?