Re: CephFS purge queue test & backport failure

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 11 Oct 2018 11:05:26 +0800



On Thu, Oct 11, 2018 at 3:03 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Tue, Oct 9, 2018 at 4:47 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> > On Wed, Oct 10, 2018 at 2:47 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > > * Zheng, how did you discover/notice that the original PR was broken?
> > > Did some tests start failing that previously passed?
> > >
> >
> > I found it during upgrading my test cluster
>
> So you've just got one running that had a purge queue which failed to
> decode? (ie, if we'd run this through the lab or some other
> hypothetical Long-Running Cluster it should have been picked up.)
>

yes


> On Wed, Oct 10, 2018 at 5:27 AM John Spray <jspray@xxxxxxxxxx> wrote:
> > On Tue, Oct 9, 2018 at 7:45 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > > And some observations and suggestions about testing:
> > > * Apparently none of our upgrade tests involve running new-code MDS on
> > > a purge queue written by the old code.
> > >   * We can and should fix that narrow issue, but it makes me think the
> > > upgrade tests are in general not as robust as one wants. Can the FS
> > > team do an audit?
> >
> > This is definitely an issue with the upgrade testing, not just with
> > CephFS but also with ceph-mgr.  It would be nice to have some kind of
> > system of hooks, where each component could have a list of actions
> > that leave some state behind (e.g. create a user in the dashboard,
> > scrape SMART data from disks), and a list of actions that will re-read
> > that state (e.g. load the dashboard's list of users).
>
> Hmm, how would these hooks differ from just "write a task that writes
> data and then a task that reads it, which are invoked on opposite ends
> of an upgrade"?
> -Greg