Hi,
On 04/09/2023 09:54, Daniel Vetter wrote:
On Wed, 30 Aug 2023 at 17:14, Helen Koike <helen.koike@xxxxxxxxxxxxx> > wrote: >> >> On 30/08/2023 11:57, Maxime Ripard wrote: >>> >>> I
agree that we need a baseline, but that baseline should be >>> defined
by the tests own merits, not their outcome on a >>> particular platform.
>>> >>> In other words, I want all drivers to follow that baseline, and
>>> if they don't it's a bug we should fix, and we should be vocal >>>
about it. We shouldn't ignore the test because it's broken. >>> >>>
Going back to the example I used previously, >>>
kms_hdmi_inject@inject-4k shouldn't fail on mt8173, ever. That's >>> a
bug. Ignoring it and reporting that "all tests are good" isn't >>> ok.
There's something wrong with that driver and we should fix >>> it. >>>
>>> Or at the very least, explain in much details what is the >>>
breakage, how we noticed it, why we can't fix it, and how to >>>
reproduce it. >>> >>> Because in its current state, there's no chance
we'll ever go >>> over that test list and remove some of them. Or even
know if, if >>> we ever fix a bug somewhere, we should remove a flaky or
failing >>> test. >>> >>> [...] >>> >>>> we need to have a clear view
about which tests are not >>>> corresponding to it, so we can start
fixing. First we need to >>>> be aware of the issues so we can start
fixing them, otherwise >>>> we will stay in the "no tests no failures"
ground :) >>> >>> I think we have somewhat contradicting goals. You want
to make >>> regression testing, so whatever test used to work in the
past >>> should keep working. That's fine, but it's different from >>>
"expectations about what the DRM drivers are supposed to pass in >>> the
IGT test suite" which is about validation, ie "all KMS >>> drivers must
behave this way". >> >> [...] >> >> >> We could have some policy: if you
want to enable a certain device >> in the CI, you need to make sure it
passes all tests first to force >> people to go fix the issues, but
maybe it would be a big barrier. >> >> I'm afraid that, if a test fail
(and it is a clear bug), people >> would just say "work for most of the
cases, this is not a priority >> to fix" and just start ignoring the CI,
this is why I think >> regression tests is a good way to start with. > >
I think eventually we need to get to both goals, but currently > driver
and test quality just isn't remotely there. > > I think a good approach
would be if CI work focuses on the pure sw > tests first, so kunit and
running igt against vgem/vkms. And then we > could use that to polish a
set of must-pass igt testcases, which > also drivers in general are
supposed to pass. Plus ideally weed out > the bad igts that aren't
reliable enough or have bad assumptions. > > For hardware I think it
will take a very long time until we get to a > point where CI can work
without a test result list, we're nowhere > close to that. But for
virtual driver this really should be > achievable, albeit with a huge
amount of effort required to get > there I think.
Yeah, this is what our experience with Mesa (in particular) has taught us.
Having 100% of the tests pass 100% of the time on 100% of the platforms
is a great goal that everyone should aim for. But it will also never happen.
Firstly, we're just not there yet today. Every single GPU-side DRM
driver has userspace-triggerable faults which cause occasional errors in
GL/Vulkan tests. Every single one. We deal with these in Mesa by
retrying; if we didn't retry, across the breadth of hardware we test,
I'd expect 99% of should-succeed merges to fail because of these
intermittent bugs in the DRM drivers. We don't have the same figure for
KMS - because we don't test it - but I'd be willing to bet no driver is
100% if you run tests often enough.
Secondly, we will never be there. If we could pause for five years and
sit down making all the current usecases for all the current hardware on
the current kernel run perfectly, we'd probably get there. But we can't:
there's new hardware, new userspace, and hundreds of new kernel trees.
Even without the first two, what happens when the Arm SMMU maintainers
(choosing a random target to pick on, sorry Robin) introduce subtle
breakage which makes a lot of tests fail some of the time? Do we refuse
to backmerge Linus into DRM until it's fixed, or do we disable all
testing on Arm until it's fixed? When we've done that, what happens when
we re-enable testing, and discover that a bunch of tests get broken
because we haven't been testing?
Thirdly, hardware is capricious. 'This board doesn't make it to u-boot'
is a clear infrastructure error, but if you test at sufficient scale,
cold solder or failing caps surface way more often than you might think.
And you can't really pick those out by any other means than running at
scale, dealing with non-binary results, and looking at the trends over
time. (Again this is something we do in Mesa - we graph test failures
per DUT, look for outliers, and pull DUTs out of the rotation when
they're clearly defective. But that only works if you actually run
enough tests on them in the first place to discover trends - if you stop
at the first failed test, it's impossible to tell the difference between
'infuriatingly infrequent kernel/test bug?' and 'cracked main board
maybe?'.)
What we do know is that we _can_ classify tests four ways in
expectations. Always-passing tests should always pass. Always-failing
tests should always fail (and update the expectations if you make them
pass). Flaking tests work often enough that they'll always pass if you
run them a couple/few times, but fail often enough that you can't rely
on them. Then you just skip tests which exhibit catastrophic failure
i.e. local DoS which affects the whole test suite.
By keeping those sets of expectations, we've been able to keep Mesa
pretty clear of regressions, whilst having a very clear set of things
that should be fixed to point to. It would be great if those set of
things were zero, but it just isn't. Having that is far better than the
two alternatives: either not testing at all (obviously bad), or having
the test always be red so it's always ignored (might as well just not test).
Cheers,
Daniel