On Mon, Sep 18, 2023 at 06:35:13PM -0300, Helen Koike wrote: > > > > I don't quite see the same picture from your side though. For example, > > > > my reading of what you've said is that flaky tests are utterly > > > > unacceptable, as are partial runs, and we shouldn't pretend otherwise. > > > > With your concrete example (which is really helpful, so thanks), what > > > > happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing > > > > until it's perfect, or does MT8173 testing always fail because that > > > > test does? > > > > > > It's not clear to me why that test is even running in the first place? > > > There's been some confusion on my side here about what we're going to > > > test with this. You've mentioned Mesa and GPUs before, but that's a KMS > > > test so there must be more to it. > > > > > > Either way, it's a relevant test so I guess why not. It turns out that > > > the test is indeed flaky, I guess we could add it to the flaky tests > > > list. > > > > > > BUT > > > > > > I want to have every opportunity to fix whatever that failure is. > > > > Agreed so far! > > > > > So: > > > > > > - Is the test broken? If so, we should report it to IGT dev and remove > > > it from the test suite. > > > - If not, is that test failure have been reported to the driver author? > > > - If no answer/fix, we can add it to the flaky tests list, but do we > > > have some way to reproduce the test failure? > > > > > > The last part is especially critical. Looking at the list itself, I have > > > no idea what board, kernel version, configuration, or what the failure > > > rate was. Assuming I spend some time looking at the infra to find the > > > board and configuration, how many times do I have to run the tests to > > > expect to reproduce the failure (and thus consider it fixed if it > > > doesn't occur anymore). > > > > > > Like, with that board and test, if my first 100 runs of the test work > > > fine, is it reasonable for me to consider it fixed, or is it only > > > supposed to happen once every 1000 runs? > > I wonder if this should be an overall policy or just let the maintainer to > decide. > > In any case these stress tests must be run from time to time to verify if > flakes are still flakes. We could do it automatically but we need to > evaluate how to do it properly so it doesn't overload the infra. That would be a great thing to do, but we can also reasonably expect that we will have other farms that may not run those tests on a regular basis, and we will have some manual testing too so I think it's still valuable. > > > So, ideally, having some (mandatory) metadata in the test lists with a > > > link to the bug report, the board (DT name?) it happened with, the > > > version and configuration it was first seen with, and an approximation > > > of the failure rate for every flaky test list. > > > > > > I understand that it's probably difficult to get that after the fact on > > > the tests that were already merged, but I'd really like to get that > > > enforced for every new test going forward. > > > > > > That should hopefully get us in a much better position to fix some of > > > those tests issues. And failing that, I can't see how that's > > > sustainable. > > > > OK yeah, and we're still agreed here. That is definitely the standard > > we should be aiming for. It is there for some - see > > drivers/gpu/drm/ci/xfails/rockchip-rk3288-skips.txt, but should be > > there for the rest, it's true. (The specific board/DT it was observed > > on can be easily retconned because we only run on one specific board > > type per driver, again to make things more predictable; we could go > > back and retrospectively add those in a header comment?) > > > > For flakes, it can be hard to pin them down, because, well, they're > > flaky. Usually when we add things in Mesa (sorry to keep coming back > > to Mesa - it's not to say that it's the objective best thing that > > everything should follow, only that it's the thing we have the most > > experience with that we know works well), we do a manual bisect and > > try to pin the blame on a specific merge request which looks like the > > most likely culprit. If nothing obvious jumps out, we just note when > > it was first observed and provide some sample job logs. But yeah, it > > should be more verbose. > > > > FWIW, the reason it wasn't done here - not to say that it shouldn't > > have been done better, but here we are - is that we just hammered a > > load of test runs, vacuumed up the results with a script, and that's > > what generated those files. Given the number of tests and devices, it > > was hard to narrow each down individually, but yeah, it is something > > which really wants further analysis and drilling into. It's a good > > to-do, and I agree it should be the standard going forward. > > Yes, during development I was just worried about to get a pipeline that > would succeed most of the time (otherwise people would start ignoring when > it fails), so they just got run a couple of times and a script filled the > flakes list. > For me the idea was "let's get a starting point" first, but yeah, we need to > improve how we deal with it from now on. Yeah, like I said, there's not much we can do for those 250-ish flakes we currently have in tree at the moment. I'd prefer to stay at 250 tests with not enough context than keep expanding that list :) > > > And Mesa does show what I'm talking about: > > > > > > $ find -name *-flakes.txt | xargs git diff --stat e58a10af640ba58b6001f5c5ad750b782547da76 > > > [...] > > > > > > In the history of Mesa, there's never been a single test removed from a > > > flaky test list. > > > > As Rob says, that's definitely wrong. But there is a good point in > > there: how do you know a test isn't flaky anymore? 100 runs is a > > reasonable benchmark, but 1000 is ideal. At a 1% failure rate, with 20 > > devices, that's just too many spurious false-fails to have a usable > > workflow. > > > > We do have some tools to make stress testing easier, but those need to > > be better documented. We'll fix that. The tools we have which also > > pull out the metadata etc also need documenting - right now they > > aren't because they're under _extremely_ heavy development, but they > > can be further enhanced to e.g. pull out the igt results automatically > > and point very clearly to the cause. Also on the to-do. > > > > > > Only maintainers can actually fix the drivers (or the tests tbf). But > > > > doing the testing does let us be really clear to everyone what the > > > > actual state is, and that way people can make informed decisions too. > > > > And the only way we're going to drive the test rate down is by the > > > > subsystem maintainers enforcing it. > > > > > > Just FYI, I'm not on the other side of the fence there, I'd really like > > > to have some kind of validation. I talked about it at XDC some years > > > ago, and discussed it several people at length over the years. So I'm > > > definitely not in the CI-is-bad camp. > > > > > > > Does that make sense on where I'm (and I think a lot of others are) > > > > coming from? > > > > > > That makes sense from your perspective, but it's not clear to me how you > > > can expect maintainers to own the tests if they were never involved in > > > the process. > > > > > > They are not in Cc of the flaky tests patches, they are not reported > > > that the bug is failing, how can they own that process if we never > > > reached out and involved them? > > > > > > We're all overworked, you can't expect them to just look at the flaky > > > test list every now and then and figure it out. > > > > Absolutely. We got acks (or at least not-nacks) from the driver > > developers, but yeah, they should absolutely be part of the loop for > > those updates. I don't think we can necessarily block on them though. > > Say we add vc4 KMS tests, then after a backmerge we start to see a > > bunch of flakes on it, but you're sitting on a beach for a couple of > > weeks. If we wait for you to get back, see it, and merge it, then > > that's two weeks of people submitting Rockchip driver changes and > > getting told that their changes failed CI. That's exactly what we want > > to avoid, because it erodes confidence and usefulness of CI when > > people expect failures and ignore them by default. > > > > So I would say that it's reasonable for expectations to be updated > > according to what actually happens in practice, but also to make sure > > that the maintainers are explicitly informed and kept in the loop, and > > not just surprised when they look at the lists and see a bunch of > > stuff happened without their knowledge. > > I was thinking in adding entries in MAINTAINERS file pointing to each > flake/skip/fails list file to their maintainers, so get_maintainers.pl can > return the right thing. Yeah, I think it's the best thing to do at the moment. It's cheap and will work ok. Maxime
Attachment:
signature.asc
Description: PGP signature