Re: [PATCH v11] drm: Add initial ci/ subdirectory

Maxime Ripard <mripard@xxxxxxxxxx> · Tue, 19 Sep 2023 11:53:52 +0200

On Mon, Sep 18, 2023 at 06:35:13PM -0300, Helen Koike wrote:
> > > > I don't quite see the same picture from your side though. For example,
> > > > my reading of what you've said is that flaky tests are utterly
> > > > unacceptable, as are partial runs, and we shouldn't pretend otherwise.
> > > > With your concrete example (which is really helpful, so thanks), what
> > > > happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing
> > > > until it's perfect, or does MT8173 testing always fail because that
> > > > test does?
> > > 
> > > It's not clear to me why that test is even running in the first place?
> > > There's been some confusion on my side here about what we're going to
> > > test with this. You've mentioned Mesa and GPUs before, but that's a KMS
> > > test so there must be more to it.
> > > 
> > > Either way, it's a relevant test so I guess why not. It turns out that
> > > the test is indeed flaky, I guess we could add it to the flaky tests
> > > list.
> > > 
> > > BUT
> > > 
> > > I want to have every opportunity to fix whatever that failure is.
> > 
> > Agreed so far!
> > 
> > > So:
> > > 
> > >    - Is the test broken? If so, we should report it to IGT dev and remove
> > >      it from the test suite.
> > >    - If not, is that test failure have been reported to the driver author?
> > >    - If no answer/fix, we can add it to the flaky tests list, but do we
> > >      have some way to reproduce the test failure?
> > > 
> > > The last part is especially critical. Looking at the list itself, I have
> > > no idea what board, kernel version, configuration, or what the failure
> > > rate was. Assuming I spend some time looking at the infra to find the
> > > board and configuration, how many times do I have to run the tests to
> > > expect to reproduce the failure (and thus consider it fixed if it
> > > doesn't occur anymore).
> > > 
> > > Like, with that board and test, if my first 100 runs of the test work
> > > fine, is it reasonable for me to consider it fixed, or is it only
> > > supposed to happen once every 1000 runs?
> 
> I wonder if this should be an overall policy or just let the maintainer to
> decide.
> 
> In any case these stress tests must be run from time to time to verify if
> flakes are still flakes. We could do it automatically but we need to
> evaluate how to do it properly so it doesn't overload the infra.

That would be a great thing to do, but we can also reasonably expect
that we will have other farms that may not run those tests on a regular
basis, and we will have some manual testing too so I think it's still
valuable.

> > > So, ideally, having some (mandatory) metadata in the test lists with a
> > > link to the bug report, the board (DT name?) it happened with, the
> > > version and configuration it was first seen with, and an approximation
> > > of the failure rate for every flaky test list.
> > > 
> > > I understand that it's probably difficult to get that after the fact on
> > > the tests that were already merged, but I'd really like to get that
> > > enforced for every new test going forward.
> > > 
> > > That should hopefully get us in a much better position to fix some of
> > > those tests issues. And failing that, I can't see how that's
> > > sustainable.
> > 
> > OK yeah, and we're still agreed here. That is definitely the standard
> > we should be aiming for.  It is there for some - see
> > drivers/gpu/drm/ci/xfails/rockchip-rk3288-skips.txt, but should be
> > there for the rest, it's true. (The specific board/DT it was observed
> > on can be easily retconned because we only run on one specific board
> > type per driver, again to make things more predictable; we could go
> > back and retrospectively add those in a header comment?)
> > 
> > For flakes, it can be hard to pin them down, because, well, they're
> > flaky. Usually when we add things in Mesa (sorry to keep coming back
> > to Mesa - it's not to say that it's the objective best thing that
> > everything should follow, only that it's the thing we have the most
> > experience with that we know works well), we do a manual bisect and
> > try to pin the blame on a specific merge request which looks like the
> > most likely culprit. If nothing obvious jumps out, we just note when
> > it was first observed and provide some sample job logs. But yeah, it
> > should be more verbose.
> > 
> > FWIW, the reason it wasn't done here - not to say that it shouldn't
> > have been done better, but here we are - is that we just hammered a
> > load of test runs, vacuumed up the results with a script, and that's
> > what generated those files. Given the number of tests and devices, it
> > was hard to narrow each down individually, but yeah, it is something
> > which really wants further analysis and drilling into. It's a good
> > to-do, and I agree it should be the standard going forward.
> 
> Yes, during development I was just worried about to get a pipeline that
> would succeed most of the time (otherwise people would start ignoring when
> it fails), so they just got run a couple of times and a script filled the
> flakes list.
> For me the idea was "let's get a starting point" first, but yeah, we need to
> improve how we deal with it from now on.

Yeah, like I said, there's not much we can do for those 250-ish flakes
we currently have in tree at the moment. I'd prefer to stay at 250 tests
with not enough context than keep expanding that list :)

> > > And Mesa does show what I'm talking about:
> > > 
> > > $ find -name *-flakes.txt | xargs git diff --stat  e58a10af640ba58b6001f5c5ad750b782547da76
> > > [...]
> > > 
> > > In the history of Mesa, there's never been a single test removed from a
> > > flaky test list.
> > 
> > As Rob says, that's definitely wrong. But there is a good point in
> > there: how do you know a test isn't flaky anymore? 100 runs is a
> > reasonable benchmark, but 1000 is ideal. At a 1% failure rate, with 20
> > devices, that's just too many spurious false-fails to have a usable
> > workflow.
> > 
> > We do have some tools to make stress testing easier, but those need to
> > be better documented. We'll fix that. The tools we have which also
> > pull out the metadata etc also need documenting - right now they
> > aren't because they're under _extremely_ heavy development, but they
> > can be further enhanced to e.g. pull out the igt results automatically
> > and point very clearly to the cause. Also on the to-do.
> > 
> > > > Only maintainers can actually fix the drivers (or the tests tbf). But
> > > > doing the testing does let us be really clear to everyone what the
> > > > actual state is, and that way people can make informed decisions too.
> > > > And the only way we're going to drive the test rate down is by the
> > > > subsystem maintainers enforcing it.
> > > 
> > > Just FYI, I'm not on the other side of the fence there, I'd really like
> > > to have some kind of validation. I talked about it at XDC some years
> > > ago, and discussed it several people at length over the years. So I'm
> > > definitely not in the CI-is-bad camp.
> > > 
> > > > Does that make sense on where I'm (and I think a lot of others are)
> > > > coming from?
> > > 
> > > That makes sense from your perspective, but it's not clear to me how you
> > > can expect maintainers to own the tests if they were never involved in
> > > the process.
> > > 
> > > They are not in Cc of the flaky tests patches, they are not reported
> > > that the bug is failing, how can they own that process if we never
> > > reached out and involved them?
> > > 
> > > We're all overworked, you can't expect them to just look at the flaky
> > > test list every now and then and figure it out.
> > 
> > Absolutely. We got acks (or at least not-nacks) from the driver
> > developers, but yeah, they should absolutely be part of the loop for
> > those updates. I don't think we can necessarily block on them though.
> > Say we add vc4 KMS tests, then after a backmerge we start to see a
> > bunch of flakes on it, but you're sitting on a beach for a couple of
> > weeks. If we wait for you to get back, see it, and merge it, then
> > that's two weeks of people submitting Rockchip driver changes and
> > getting told that their changes failed CI. That's exactly what we want
> > to avoid, because it erodes confidence and usefulness of CI when
> > people expect failures and ignore them by default.
> > 
> > So I would say that it's reasonable for expectations to be updated
> > according to what actually happens in practice, but also to make sure
> > that the maintainers are explicitly informed and kept in the loop, and
> > not just surprised when they look at the lists and see a bunch of
> > stuff happened without their knowledge.
> 
> I was thinking in adding entries in MAINTAINERS file pointing to each
> flake/skip/fails list file to their maintainers, so get_maintainers.pl can
> return the right thing.

Yeah, I think it's the best thing to do at the moment. It's cheap and
will work ok.

Maxime
Attachment:
signature.asc

Description: PGP signature