Re: CephFS purge queue test & backport failure

John Spray <jspray@xxxxxxxxxx> · Wed, 10 Oct 2018 13:27:24 +0100

On Tue, Oct 9, 2018 at 7:45 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> Hey guys, I'd like to understand how our tests failed in the 13.2.2
> CephFS purge queue issue.
>
> In terms of the actual code merge flow, it looks like:
>
> * The original patch passed review and test and merged to master
> August 6 in https://github.com/ceph/ceph/pull/22850
>   * but was wrong, and reverted a day later.
> * Fixed PR merged August 16 in https://github.com/ceph/ceph/pull/23467
> * A backport ticket was created August 21: http://tracker.ceph.com/issues/26989
> * A backport PR was submitted August 28 based on the original broken
> code at https://github.com/ceph/ceph/pull/23818
>   * It passed review and test and merged on September 12.
>
> Obviously we can wish review had been more successful, but we can't
> rely on that. So I've got some questions:
>
> * Zheng, how did you discover/notice that the original PR was broken?
> Did some tests start failing that previously passed?
>
> * Nathan, are there backport tools that make doing the right thing or
> wrong thing easier in situations like this, where multiple commits
> will claim to Fix the same bug and some of them are wrong?
>
> And some observations and suggestions about testing:
> * Apparently none of our upgrade tests involve running new-code MDS on
> a purge queue written by the old code.
>   * We can and should fix that narrow issue, but it makes me think the
> upgrade tests are in general not as robust as one wants. Can the FS
> team do an audit?

This is definitely an issue with the upgrade testing, not just with
CephFS but also with ceph-mgr.  It would be nice to have some kind of
system of hooks, where each component could have a list of actions
that leave some state behind (e.g. create a user in the dashboard,
scrape SMART data from disks), and a list of actions that will re-read
that state (e.g. load the dashboard's list of users).

John

>
> * Looking at the issue (new data members were encoded at the front,
> and without bumping the struct_v), I think this should have been
> detected if we ran it through the ceph-object-corpus dencoder tests
> and had the right bits and pieces present there. These tests are part
> of "make check" but haven't been updated in quite a while — I think
> this bumps the priority of doing that work up since we now have a
> specific failure it would have detected early.
>
> Thanks! Any other thoughts?
> -Greg