On Tue, Oct 9, 2018 at 4:47 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > On Wed, Oct 10, 2018 at 2:47 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > * Zheng, how did you discover/notice that the original PR was broken? > > Did some tests start failing that previously passed? > > > > I found it during upgrading my test cluster So you've just got one running that had a purge queue which failed to decode? (ie, if we'd run this through the lab or some other hypothetical Long-Running Cluster it should have been picked up.) On Wed, Oct 10, 2018 at 5:27 AM John Spray <jspray@xxxxxxxxxx> wrote: > On Tue, Oct 9, 2018 at 7:45 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > And some observations and suggestions about testing: > > * Apparently none of our upgrade tests involve running new-code MDS on > > a purge queue written by the old code. > > * We can and should fix that narrow issue, but it makes me think the > > upgrade tests are in general not as robust as one wants. Can the FS > > team do an audit? > > This is definitely an issue with the upgrade testing, not just with > CephFS but also with ceph-mgr. It would be nice to have some kind of > system of hooks, where each component could have a list of actions > that leave some state behind (e.g. create a user in the dashboard, > scrape SMART data from disks), and a list of actions that will re-read > that state (e.g. load the dashboard's list of users). Hmm, how would these hooks differ from just "write a task that writes data and then a task that reads it, which are invoked on opposite ends of an upgrade"? -Greg