Re: [PATCH v11] drm: Add initial ci/ subdirectory

Maxime Ripard <mripard@xxxxxxxxxx> · Mon, 11 Sep 2023 14:51:26 +0200

On Mon, Sep 11, 2023 at 02:13:43PM +0200, Michel Dänzer wrote:
> On 9/11/23 11:34, Maxime Ripard wrote:
> > On Thu, Sep 07, 2023 at 01:40:02PM +0200, Daniel Stone wrote:
> >> Yeah, this is what our experience with Mesa (in particular) has taught us.
> >>
> >> Having 100% of the tests pass 100% of the time on 100% of the platforms is a
> >> great goal that everyone should aim for. But it will also never happen.
> >>
> >> Firstly, we're just not there yet today. Every single GPU-side DRM driver
> >> has userspace-triggerable faults which cause occasional errors in GL/Vulkan
> >> tests. Every single one. We deal with these in Mesa by retrying; if we
> >> didn't retry, across the breadth of hardware we test, I'd expect 99% of
> >> should-succeed merges to fail because of these intermittent bugs in the DRM
> >> drivers.
> > 
> > So the plan is only to ever test rendering devices? It should have been
> > made clearer then.
> > 
> >> We don't have the same figure for KMS - because we don't test it - but
> >> I'd be willing to bet no driver is 100% if you run tests often enough.
> > 
> > And I would still consider that a bug that we ought to fix, and
> > certainly not something we should sweep under the rug. If half the tests
> > are not running on a driver, then fine, they aren't. I'm not really
> > against having failing tests, I'm against not flagging unreliable tests
> > on a given hardware as failing tests.
> 
> A flaky test will by definition give a "pass" result at least some of the time, which would be considered a failure by the CI if the test is marked as failing.
> 
> 
> >> Secondly, we will never be there. If we could pause for five years and sit
> >> down making all the current usecases for all the current hardware on the
> >> current kernel run perfectly, we'd probably get there. But we can't: there's
> >> new hardware, new userspace, and hundreds of new kernel trees.
> > 
> > Not with that attitude :)
> 
> Attitude is not the issue, the complexity of the multiple systems
> involved is.

FTR, that was a meme/joke.

> > I'm not sure it's actually an argument, really. 10 years ago, we would
> > never have been at "every GPU on the market has an open-source driver"
> > here. 5 years ago, we would never have been at this-series-here. That
> > didn't stop anyone making progress, everyone involved in that thread
> > included.
> 
> Even assuming perfection is achievable at all (which is very doubtful,
> given the experience from the last few years of CI in Mesa and other
> projects), if you demand perfection before even taking the first step,
> it will never get off the ground.

Perfection and scale from the get-go isn't reasonable, yes. Building a
small, "perfect" (your words, not mine) system that you can later expand
is doable. And that's very much a design choice.

> > How are we even supposed to detect those failures in the first
> > place if tests are flagged as unreliable?
> 
> Based on experience with Mesa, only a relatively small minority of
> tests should need to be marked as flaky / not run at all. The majority
> of tests are reliable and can catch regressions even while some tests
> are not yet.

I understand and acknowledge that it worked with Mesa. That's great for
Mesa. That still doesn't mean that it's the panacea and is for every
project.

> > No matter what we do here, what you describe will always happen. Like,
> > if we do flag those tests as unreliable, what exactly prevents another
> > issue to come on top undetected, and what will happen when we re-enable
> > testing?
> 
> Any issues affecting a test will need to be fixed before (re-)enabling
> the test for CI.

If that underlying issue is never fixed, at which point do we consider
that it's a failure and should never be re-enabled? Who has that role?

> > On top of that, you kind of hinted at that yourself, but what set of
> > tests will pass is a property linked to a single commit. Having that
> > list within the kernel already alters that: you'll need to merge a new
> > branch, add a bunch of fixes and then change the test list state. You
> > won't have the same tree you originally tested (and defined the test
> > state list for).
> 
> Ideally, the test state lists should be changed in the same commits
> which affect the test results. It'll probably take a while yet to get
> there for the kernel.
> 
> > It might or might not be an issue for Linus' release, but I can
> > definitely see the trouble already for stable releases where fixes will
> > be backported, but the test state list certainly won't be updated.
> 
> If the stable branch maintainers want to take advantage of CI for the
> stable branches, they may need to hunt for corresponding state list
> commits sometimes. They'll need to take that into account for their
> decision.

So we just expect the stable maintainers to track each and every patches
involved in a test run, make sure that they are in a stable tree, and
then update the test list? Without having consulted them at all?

> >> By keeping those sets of expectations, we've been able to keep Mesa pretty
> >> clear of regressions, whilst having a very clear set of things that should
> >> be fixed to point to. It would be great if those set of things were zero,
> >> but it just isn't. Having that is far better than the two alternatives:
> >> either not testing at all (obviously bad), or having the test always be red
> >> so it's always ignored (might as well just not test).
> > 
> > Isn't that what happens with flaky tests anyway?
> 
> For a small minority of tests. Daniel was referring to whole test suites.
> 
> > Even more so since we have 0 context when updating that list.
> 
> The commit log can provide whatever context is needed.

Sure, I've yet to see that though.

There's in 6.6-rc1 around 240 reported flaky tests. None of them have
any context. That new series hads a few dozens too, without any context
either. And there's no mention about that being a plan, or a patch
adding a new policy for all tests going forward.

So I'm still fairly doubtful it will ever happen.

> > I've asked a couple of times, I'll ask again. In that other series, on
> > the MT8173, kms_hdmi_inject@inject-4k is setup as flaky (which is a KMS
> > test btw).
> > 
> > I'm a maintainer for that part of the kernel, I'd like to look into it,
> > because it's seriously something that shouldn't fail, ever, the hardware
> > isn't involved.
> > 
> > How can I figure out now (or worse, let's say in a year) how to
> > reproduce it? What kernel version was affected? With what board? After
> > how many occurences?
> > 
> > Basically, how can I see that the bug is indeed there (or got fixed
> > since), and how to start fixing it?
> 
> Many of those things should be documented in the commit log of the
> state list change.
> 
> How the CI works in general should be documented in some appropriate
> place in tree.

I think I'll stop the discussion there. It was merged anyway so I'm not
quite sure why I was asked to give my feedback on this. Any concern I
raised were met with a giant "it worked on Mesa" handwave or "someone
will probably work on it at some point".

And fine, guess I'm wrong.

Thanks
Maxime
Attachment:
signature.asc

Description: PGP signature