On Tue, Oct 9, 2018 at 7:45 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > Hey guys, I'd like to understand how our tests failed in the 13.2.2 > CephFS purge queue issue. > > In terms of the actual code merge flow, it looks like: > > * The original patch passed review and test and merged to master > August 6 in https://github.com/ceph/ceph/pull/22850 > * but was wrong, and reverted a day later. > * Fixed PR merged August 16 in https://github.com/ceph/ceph/pull/23467 > * A backport ticket was created August 21: http://tracker.ceph.com/issues/26989 > * A backport PR was submitted August 28 based on the original broken > code at https://github.com/ceph/ceph/pull/23818 > * It passed review and test and merged on September 12. > > Obviously we can wish review had been more successful, but we can't > rely on that. So I've got some questions: > > * Zheng, how did you discover/notice that the original PR was broken? > Did some tests start failing that previously passed? > > * Nathan, are there backport tools that make doing the right thing or > wrong thing easier in situations like this, where multiple commits > will claim to Fix the same bug and some of them are wrong? > > And some observations and suggestions about testing: > * Apparently none of our upgrade tests involve running new-code MDS on > a purge queue written by the old code. > * We can and should fix that narrow issue, but it makes me think the > upgrade tests are in general not as robust as one wants. Can the FS > team do an audit? This is definitely an issue with the upgrade testing, not just with CephFS but also with ceph-mgr. It would be nice to have some kind of system of hooks, where each component could have a list of actions that leave some state behind (e.g. create a user in the dashboard, scrape SMART data from disks), and a list of actions that will re-read that state (e.g. load the dashboard's list of users). John > > * Looking at the issue (new data members were encoded at the front, > and without bumping the struct_v), I think this should have been > detected if we ran it through the ceph-object-corpus dencoder tests > and had the right bits and pieces present there. These tests are part > of "make check" but haven't been updated in quite a while — I think > this bumps the priority of doing that work up since we now have a > specific failure it would have detected early. > > Thanks! Any other thoughts? > -Greg