On Tue, Feb 05, 2019 at 11:34:54AM +0100, Johannes Schindelin wrote: > Peff, you asked at the Contributors' Summit for a way to get notified when > CI fails for your patch, and I was hesitant to add it (even if it would be > straight-forward, really) because of the false positives. > > This is one such example, as the test fails: > > https://dev.azure.com/gitgitgadget/git/_build/results?buildId=944 > > In particular, the tests t2024 and t5552 are broken for > ma/doc-diff-usage-fix on Windows. The reason seems to be that those are > already broken for the base commit that Junio picked: > jk/diff-rendered-docs (actually, not the tip of it, but the commit fixed > by Martin's patch). > > Of course, I understand why Junio picks base commits that are far, far in > the past (and have regressions that current `master` does not have), it > makes sense from the point of view where the fixes should be as close to > the commits they fix. The downside is that we cannot automate regression > testing more than we do now, with e.g. myself acting as curator of test > failures. Thanks for this real-world example; it's definitely something I hadn't thought too hard about. I think this is a specific case of more general problems with testing. We strive to have every commit pass all of the tests, so that people building on top have a known-good base. But there are many reasons that may fail in practice (not exhaustive, but just things I've personally seen): - some tests are flaky, and will fail intermittently - some tests _used_ to pass, but no longer do if the environment has changed (e.g., relying on behavior of system tools that change) - historical mistakes, where "all tests pass" was only true on one platform but not others (which I think is what's going on here) And these are a problem not just for automated CI, but for running "make test" locally. I don't think we'll ever get 100% accuracy, so at some point we have to accept some annoying false positives. The question is how often they come up, and whether they drown out real problems. Testing under Linux, my experience with the first two is that they're uncommon enough not to be a huge burden. The third class seems like it is likely to be a lot more common for Windows builds, since we haven't historically run tests on them. But it would also in theory be a thing that would get better going forward, as we fix all of those test failures (and new commits are less likely to be built on truly ancient history). So IMHO this isn't really a show-stopper problem, so much as something that is a sign of the maturing test/CI setup (I say "maturing", not "mature", as it seems we've probably still got a ways to go). As far as notifications go, it probably makes sense for them to be something that requires the user to sign up for anyway, so at that point they're making their own choice about whether the signal to noise ratio is acceptable. I also think there are ways to automate away some of these problems (e.g., flake detection by repeating test failures, re-running failures on parent commits to check whether a patch actually introduced the problem). But implementing those is non-trivial work, so I am certainly not asking you to do it. -Peff