Re: CephFS purge queue test & backport failure

"Yan, Zheng" <ukernel@xxxxxxxxx> · Wed, 10 Oct 2018 07:47:35 +0800



On Wed, Oct 10, 2018 at 2:47 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> Hey guys, I'd like to understand how our tests failed in the 13.2.2
> CephFS purge queue issue.
>
> In terms of the actual code merge flow, it looks like:
>
> * The original patch passed review and test and merged to master
> August 6 in https://github.com/ceph/ceph/pull/22850
>   * but was wrong, and reverted a day later.
> * Fixed PR merged August 16 in https://github.com/ceph/ceph/pull/23467
> * A backport ticket was created August 21: http://tracker.ceph.com/issues/26989
> * A backport PR was submitted August 28 based on the original broken
> code at https://github.com/ceph/ceph/pull/23818
>   * It passed review and test and merged on September 12.
>
> Obviously we can wish review had been more successful, but we can't
> rely on that. So I've got some questions:
>
> * Zheng, how did you discover/notice that the original PR was broken?
> Did some tests start failing that previously passed?
>

I found it during upgrading my test cluster


> * Nathan, are there backport tools that make doing the right thing or
> wrong thing easier in situations like this, where multiple commits
> will claim to Fix the same bug and some of them are wrong?
>
> And some observations and suggestions about testing:
> * Apparently none of our upgrade tests involve running new-code MDS on
> a purge queue written by the old code.
>   * We can and should fix that narrow issue, but it makes me think the
> upgrade tests are in general not as robust as one wants. Can the FS
> team do an audit?
>
> * Looking at the issue (new data members were encoded at the front,
> and without bumping the struct_v), I think this should have been
> detected if we ran it through the ceph-object-corpus dencoder tests
> and had the right bits and pieces present there. These tests are part
> of "make check" but haven't been updated in quite a while — I think
> this bumps the priority of doing that work up since we now have a
> specific failure it would have detected early.
>
> Thanks! Any other thoughts?
> -Greg