Re: testing downgrades

Sage Weil <sweil@xxxxxxxxxx> · Fri, 7 Sep 2018 02:44:35 +0000 (UTC)

On Thu, 6 Sep 2018, Gregory Farnum wrote:
> On Thu, Sep 6, 2018 at 3:34 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > On Thu, 6 Sep 2018, Gregory Farnum wrote:
> >> >> It may just be bias in my recent thoughts, but it seems like the most
> >> >> valuable thing to do here is to generate the encoded structures from
> >> >> each rc and make sure the previous version of the code can read them.
> >> >> In other words, generate a ceph-object-corpus for each existing
> >> >> release and proposed rc (which I've been planning to start pushing on,
> >> >> but haven't yet, so I don't even remember how we generate them!), then
> >> >> run the encode and decode with the *old* software.
> >> >
> >> > Yes!!  I like this because it also kills a couple birds with one stone: if
> >> > we automate the process of generating corpus objects then we can also make
> >> > sure we're covering all the usual upgrade cases too.
> >> >
> >> > The challenge is that generating corpus objects means a custom
> >> > build and then running a sufficiently broad set of workloads to
> >> > instantiate as many different and interesting object instances as
> >> > possible.  The
> >> >
> >> > //#define ENCODE_DUMP_PATH /tmp/something
> >> >
> >> > needs to be defined, everything built, and then the system run with a
> >> > bunch of workloads.  This fills /tmp/something will a bazillion object
> >> > instances.  There are then some scripts that dedup and try to pick out
> >> > a sample with varying sizes etc.
> >> >
> >> > I don't have any bright ideas on how to do that easily and in an automated
> >> > way, though.. we presumably want to do it right before release to make
> >> > sure everythign is kosher (to ensure compat with everything in teh
> >> > corpus), and also generate objects on actual releases (or builds of the
> >> > same sha1 + the above #define) to populate the corpus with that release.
> >>
> >> So...why do we actually need to run a cluster to generate these? Is it
> >> in fact infeasible to automatically generate the data? Was it just too
> >> annoying at the time ceph-object-corpus was set up? I haven't examined
> >> it in detail but we already include simple ones in the
> >> generate_test_instances stuff for everything in
> >> src/tools/ceph-dencoder/types.h, though I can certainly imagine that
> >> these might not be good enough since a lot of them are just
> >> default-initialized. (Do we catch examples from anything that *isn't*
> >> specified in that file?)
> >>
> >> Something like running a cluster through the final set of teuthology
> >> suites set up this way might be the best solution, but I wonder if
> >> this was investigated and decided on or just that it worked once upon
> >> a time. ;)
> >
> > The problem is that those generate_test_instances are (1) sparse and
> > minimal, and (2) would require huge developer investment to fill in with
> > "realistic" instances, and (3) even though wouldn't necessarily be
> > representative of what happens in real life.  The ENCODE_DUMP_PATH thing
> > collects actual objects from a real cluster with a real workload so that
> > you can get a "real" sampling.
> >
> > The hard part is mostly generating a worklaod with good coverage (rados
> > API tests, cls tests, etc are a good start for RADOS; for cephfs we need
> > to multi-mds to cover all of the subtree migration related types; for rgw
> > we'll want to do get coverage for the multisite stuff, etc etc.).
> >
> > That's a bit of work, but it's still much less work than hand-crafting
> > object instances that may or may not be "real".
> 
> Yeah, that makes sense. But turning that around, why not just grab and
> sample from the OSD and monitor disk stores after our existing
> teuthology runs happen? Does the ENCODE_DUMP_PATH stuff also include
> wire protocol messages that don't get put on disk?

It includes everything that passes through the ENCODE_{START,FINISH} 
macros, which is pretty much every object with an encode/decode method 
defined.

One could write a tool to pull data structures out of, say, bluestore, but 
that would only cover the dozen or so bluestore-related types.  
ceph-dencoder currently recognizes ~425.  About 220 are captures by the 
most recent version in ceph-object-corpus (kraken :( ).

Unfortunately the encode dump stuff is super inefficient.. I don't think 
it's something we can easily build in to our real builds.  Well... maybe 
we could build it into the notcmalloc (debug) builds, actually...

sage