On Thu, Oct 11, 2018 at 3:03 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > On Tue, Oct 9, 2018 at 4:47 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > On Wed, Oct 10, 2018 at 2:47 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > > * Zheng, how did you discover/notice that the original PR was broken? > > > Did some tests start failing that previously passed? > > > > > > > I found it during upgrading my test cluster > > So you've just got one running that had a purge queue which failed to > decode? (ie, if we'd run this through the lab or some other > hypothetical Long-Running Cluster it should have been picked up.) > yes > On Wed, Oct 10, 2018 at 5:27 AM John Spray <jspray@xxxxxxxxxx> wrote: > > On Tue, Oct 9, 2018 at 7:45 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > > And some observations and suggestions about testing: > > > * Apparently none of our upgrade tests involve running new-code MDS on > > > a purge queue written by the old code. > > > * We can and should fix that narrow issue, but it makes me think the > > > upgrade tests are in general not as robust as one wants. Can the FS > > > team do an audit? > > > > This is definitely an issue with the upgrade testing, not just with > > CephFS but also with ceph-mgr. It would be nice to have some kind of > > system of hooks, where each component could have a list of actions > > that leave some state behind (e.g. create a user in the dashboard, > > scrape SMART data from disks), and a list of actions that will re-read > > that state (e.g. load the dashboard's list of users). > > Hmm, how would these hooks differ from just "write a task that writes > data and then a task that reads it, which are invoked on opposite ends > of an upgrade"? > -Greg