Hey, On Thu, 14 Sept 2023 at 10:54, Maxime Ripard <mripard@xxxxxxxxxx> wrote: > On Tue, Sep 12, 2023 at 02:16:41PM +0100, Daniel Stone wrote: > > Hopefully less mangled formatting this time: turns out Thunderbird + > > plain text is utterly unreadable, so that's one less MUA that is > > actually usable to send email to kernel lists without getting shouted > > at. > > Sorry if it felt that way, it definitely wasn't my intention to shout at > you. Email is indeed kind of a pain to deal with, and I wanted to keep > the discussion going. My bad - I didn't mean you _at all_. I was thinking of other, much less pleasant, kernel maintainers, and the ongoing struggle of attempting to actually send well-formatted email to kernel lists in 2023. > > I don't quite see the same picture from your side though. For example, > > my reading of what you've said is that flaky tests are utterly > > unacceptable, as are partial runs, and we shouldn't pretend otherwise. > > With your concrete example (which is really helpful, so thanks), what > > happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing > > until it's perfect, or does MT8173 testing always fail because that > > test does? > > It's not clear to me why that test is even running in the first place? > There's been some confusion on my side here about what we're going to > test with this. You've mentioned Mesa and GPUs before, but that's a KMS > test so there must be more to it. > > Either way, it's a relevant test so I guess why not. It turns out that > the test is indeed flaky, I guess we could add it to the flaky tests > list. > > BUT > > I want to have every opportunity to fix whatever that failure is. Agreed so far! > So: > > - Is the test broken? If so, we should report it to IGT dev and remove > it from the test suite. > - If not, is that test failure have been reported to the driver author? > - If no answer/fix, we can add it to the flaky tests list, but do we > have some way to reproduce the test failure? > > The last part is especially critical. Looking at the list itself, I have > no idea what board, kernel version, configuration, or what the failure > rate was. Assuming I spend some time looking at the infra to find the > board and configuration, how many times do I have to run the tests to > expect to reproduce the failure (and thus consider it fixed if it > doesn't occur anymore). > > Like, with that board and test, if my first 100 runs of the test work > fine, is it reasonable for me to consider it fixed, or is it only > supposed to happen once every 1000 runs? > > So, ideally, having some (mandatory) metadata in the test lists with a > link to the bug report, the board (DT name?) it happened with, the > version and configuration it was first seen with, and an approximation > of the failure rate for every flaky test list. > > I understand that it's probably difficult to get that after the fact on > the tests that were already merged, but I'd really like to get that > enforced for every new test going forward. > > That should hopefully get us in a much better position to fix some of > those tests issues. And failing that, I can't see how that's > sustainable. OK yeah, and we're still agreed here. That is definitely the standard we should be aiming for. It is there for some - see drivers/gpu/drm/ci/xfails/rockchip-rk3288-skips.txt, but should be there for the rest, it's true. (The specific board/DT it was observed on can be easily retconned because we only run on one specific board type per driver, again to make things more predictable; we could go back and retrospectively add those in a header comment?) For flakes, it can be hard to pin them down, because, well, they're flaky. Usually when we add things in Mesa (sorry to keep coming back to Mesa - it's not to say that it's the objective best thing that everything should follow, only that it's the thing we have the most experience with that we know works well), we do a manual bisect and try to pin the blame on a specific merge request which looks like the most likely culprit. If nothing obvious jumps out, we just note when it was first observed and provide some sample job logs. But yeah, it should be more verbose. FWIW, the reason it wasn't done here - not to say that it shouldn't have been done better, but here we are - is that we just hammered a load of test runs, vacuumed up the results with a script, and that's what generated those files. Given the number of tests and devices, it was hard to narrow each down individually, but yeah, it is something which really wants further analysis and drilling into. It's a good to-do, and I agree it should be the standard going forward. > And Mesa does show what I'm talking about: > > $ find -name *-flakes.txt | xargs git diff --stat e58a10af640ba58b6001f5c5ad750b782547da76 > [...] > > In the history of Mesa, there's never been a single test removed from a > flaky test list. As Rob says, that's definitely wrong. But there is a good point in there: how do you know a test isn't flaky anymore? 100 runs is a reasonable benchmark, but 1000 is ideal. At a 1% failure rate, with 20 devices, that's just too many spurious false-fails to have a usable workflow. We do have some tools to make stress testing easier, but those need to be better documented. We'll fix that. The tools we have which also pull out the metadata etc also need documenting - right now they aren't because they're under _extremely_ heavy development, but they can be further enhanced to e.g. pull out the igt results automatically and point very clearly to the cause. Also on the to-do. > > Only maintainers can actually fix the drivers (or the tests tbf). But > > doing the testing does let us be really clear to everyone what the > > actual state is, and that way people can make informed decisions too. > > And the only way we're going to drive the test rate down is by the > > subsystem maintainers enforcing it. > > Just FYI, I'm not on the other side of the fence there, I'd really like > to have some kind of validation. I talked about it at XDC some years > ago, and discussed it several people at length over the years. So I'm > definitely not in the CI-is-bad camp. > > > Does that make sense on where I'm (and I think a lot of others are) > > coming from? > > That makes sense from your perspective, but it's not clear to me how you > can expect maintainers to own the tests if they were never involved in > the process. > > They are not in Cc of the flaky tests patches, they are not reported > that the bug is failing, how can they own that process if we never > reached out and involved them? > > We're all overworked, you can't expect them to just look at the flaky > test list every now and then and figure it out. Absolutely. We got acks (or at least not-nacks) from the driver developers, but yeah, they should absolutely be part of the loop for those updates. I don't think we can necessarily block on them though. Say we add vc4 KMS tests, then after a backmerge we start to see a bunch of flakes on it, but you're sitting on a beach for a couple of weeks. If we wait for you to get back, see it, and merge it, then that's two weeks of people submitting Rockchip driver changes and getting told that their changes failed CI. That's exactly what we want to avoid, because it erodes confidence and usefulness of CI when people expect failures and ignore them by default. So I would say that it's reasonable for expectations to be updated according to what actually happens in practice, but also to make sure that the maintainers are explicitly informed and kept in the loop, and not just surprised when they look at the lists and see a bunch of stuff happened without their knowledge. Again there's much more to be done on the tooling here. Part of it is CLI tools and automation, part of it is dashboards and easily-digestible reporting, and then there's integration with things like KernelCI. KCI(DB) is actually quite high up on the list, but we're mostly waiting until a lot of the KCI rework happens so we can actually properly integrate with the new system. Right now a lot of the tooling we have is pretty involved - for example, we do have ci-collate as a Python library which can inspect a number of pipelines, pull out detailed status and logs, etc, but it mostly needs to be used as a library with bespoke scripts, rather than a ready-made tool. Work on that is ongoing to make it way more clear and accessible though. So I think it sounds like we're on the same page and going exactly in the same direction, just that this is a starting point rather than the desired end state. And the main point is that having a set of xfails/flakes parachuted in with little to no context is trying to get an MVP bootstrapped, rather than how we expect things to go in future. Does that sound about right? Cheers, Daniel