On Thu, Sep 6, 2018 at 3:15 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > On Thu, 6 Sep 2018, Gregory Farnum wrote: >> On Tue, Sep 4, 2018 at 12:31 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> > There is no plan or expectation (on my part at least) to support >> > downgrades across major versions (mimic -> luminous). (IMO the complexity >> > that would need to be introduced to allow this is not worth the >> > investment.) >> > >> > However, there *is* an interest in testing (and being able to support) >> > downgrades between minor versions. For example, if you're running 13.2.2, >> > and start rolling out 13.2.3 but things misbehave, it would be nice to be >> > able to roll back to 13.2.2. >> > >> > So.. what do we need to do to allow this? >> > >> > 1. Create a test suite that captures the downgrade cases. We could start >> > with a teuthology facet for the initial version and have another facet for >> > the target version. Teuthology can't enforce a strict ordering (i.e., >> > always a downgrade) but it's probably just as valuable to also test the >> > upgrade cases too. The main challenge I see here is that we are regularly >> > fixing bugs in the stable releases; since the tests are against older >> > releases, problems we uncover are often things that we can't "fix" since >> > it's existing code. >> > >> > It will probably (?) be the case that in the end we have known issues with >> > downgrades with specifics versions. >> > >> > What kind of workloads should we run? >> > >> > Should we repurpose the p2p suite to do this? Right now it steps through >> > every stable release in sequence. Instead, we could add N facets that >> > upgrade (or downgrade) between versions, and make sure that N >= the total >> > number of point releases. This would be more like an upgrade/downgrade >> > "thrasher" in that case... >> >> It may just be bias in my recent thoughts, but it seems like the most >> valuable thing to do here is to generate the encoded structures from >> each rc and make sure the previous version of the code can read them. >> In other words, generate a ceph-object-corpus for each existing >> release and proposed rc (which I've been planning to start pushing on, >> but haven't yet, so I don't even remember how we generate them!), then >> run the encode and decode with the *old* software. > > Yes!! I like this because it also kills a couple birds with one stone: if > we automate the process of generating corpus objects then we can also make > sure we're covering all the usual upgrade cases too. > > The challenge is that generating corpus objects means a custom > build and then running a sufficiently broad set of workloads to > instantiate as many different and interesting object instances as > possible. The > > //#define ENCODE_DUMP_PATH /tmp/something > > needs to be defined, everything built, and then the system run with a > bunch of workloads. This fills /tmp/something will a bazillion object > instances. There are then some scripts that dedup and try to pick out > a sample with varying sizes etc. > > I don't have any bright ideas on how to do that easily and in an automated > way, though.. we presumably want to do it right before release to make > sure everythign is kosher (to ensure compat with everything in teh > corpus), and also generate objects on actual releases (or builds of the > same sha1 + the above #define) to populate the corpus with that release. So...why do we actually need to run a cluster to generate these? Is it in fact infeasible to automatically generate the data? Was it just too annoying at the time ceph-object-corpus was set up? I haven't examined it in detail but we already include simple ones in the generate_test_instances stuff for everything in src/tools/ceph-dencoder/types.h, though I can certainly imagine that these might not be good enough since a lot of them are just default-initialized. (Do we catch examples from anything that *isn't* specified in that file?) Something like running a cluster through the final set of teuthology suites set up this way might be the best solution, but I wonder if this was investigated and decided on or just that it worked once upon a time. ;) -Greg