Re: [PATCH v3] doc-diff: don't `cd_to_toplevel`

Jeff King <peff@xxxxxxxx> · Wed, 6 Feb 2019 13:49:03 -0500

On Tue, Feb 05, 2019 at 11:34:54AM +0100, Johannes Schindelin wrote:

> Peff, you asked at the Contributors' Summit for a way to get notified when
> CI fails for your patch, and I was hesitant to add it (even if it would be
> straight-forward, really) because of the false positives.
> 
> This is one such example, as the test fails:
> 
> 	https://dev.azure.com/gitgitgadget/git/_build/results?buildId=944
> 
> In particular, the tests t2024 and t5552 are broken for
> ma/doc-diff-usage-fix on Windows. The reason seems to be that those are
> already broken for the base commit that Junio picked:
> jk/diff-rendered-docs (actually, not the tip of it, but the commit fixed
> by Martin's patch).
> 
> Of course, I understand why Junio picks base commits that are far, far in
> the past (and have regressions that current `master` does not have), it
> makes sense from the point of view where the fixes should be as close to
> the commits they fix.  The downside is that we cannot automate regression
> testing more than we do now, with e.g. myself acting as curator of test
> failures.

Thanks for this real-world example; it's definitely something I hadn't
thought too hard about.

I think this is a specific case of more general problems with testing.
We strive to have every commit pass all of the tests, so that people
building on top have a known-good base. But there are many reasons that
may fail in practice (not exhaustive, but just things I've personally
seen):

  - some tests are flaky, and will fail intermittently

  - some tests _used_ to pass, but no longer do if the environment has
    changed (e.g., relying on behavior of system tools that change)

  - historical mistakes, where "all tests pass" was only true on one
    platform but not others (which I think is what's going on here)

And these are a problem not just for automated CI, but for running "make
test" locally. I don't think we'll ever get 100% accuracy, so at some
point we have to accept some annoying false positives. The question is
how often they come up, and whether they drown out real problems.
Testing under Linux, my experience with the first two is that they're
uncommon enough not to be a huge burden.

The third class seems like it is likely to be a lot more common for
Windows builds, since we haven't historically run tests on them. But it
would also in theory be a thing that would get better going forward, as
we fix all of those test failures (and new commits are less likely to be
built on truly ancient history).

So IMHO this isn't really a show-stopper problem, so much as something
that is a sign of the maturing test/CI setup (I say "maturing", not
"mature", as it seems we've probably still got a ways to go). As far as
notifications go, it probably makes sense for them to be something that
requires the user to sign up for anyway, so at that point they're making
their own choice about whether the signal to noise ratio is acceptable.

I also think there are ways to automate away some of these problems
(e.g., flake detection by repeating test failures, re-running failures
on parent commits to check whether a patch actually introduced the
problem). But implementing those is non-trivial work, so I am certainly
not asking you to do it.

-Peff