Re: CephFS purge queue test & backport failure

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 10 Oct 2018 11:58:18 -0700

On Tue, Oct 9, 2018 at 4:47 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> On Wed, Oct 10, 2018 at 2:47 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > * Zheng, how did you discover/notice that the original PR was broken?
> > Did some tests start failing that previously passed?
> >
>
> I found it during upgrading my test cluster

So you've just got one running that had a purge queue which failed to
decode? (ie, if we'd run this through the lab or some other
hypothetical Long-Running Cluster it should have been picked up.)

On Wed, Oct 10, 2018 at 5:27 AM John Spray <jspray@xxxxxxxxxx> wrote:
> On Tue, Oct 9, 2018 at 7:45 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > And some observations and suggestions about testing:
> > * Apparently none of our upgrade tests involve running new-code MDS on
> > a purge queue written by the old code.
> >   * We can and should fix that narrow issue, but it makes me think the
> > upgrade tests are in general not as robust as one wants. Can the FS
> > team do an audit?
>
> This is definitely an issue with the upgrade testing, not just with
> CephFS but also with ceph-mgr.  It would be nice to have some kind of
> system of hooks, where each component could have a list of actions
> that leave some state behind (e.g. create a user in the dashboard,
> scrape SMART data from disks), and a list of actions that will re-read
> that state (e.g. load the dashboard's list of users).

Hmm, how would these hooks differ from just "write a task that writes
data and then a task that reads it, which are invoked on opposite ends
of an upgrade"?
-Greg