Re: [PATCH v11] drm: Add initial ci/ subdirectory

Daniel Stone <daniels@xxxxxxxxxxxxx> · Thu, 7 Sep 2023 13:40:02 +0200

Hi,

On 04/09/2023 09:54, Daniel Vetter wrote:
On Wed, 30 Aug 2023 at 17:14, Helen Koike <helen.koike@xxxxxxxxxxxxx>  > wrote: >> >> On 30/08/2023 11:57, Maxime Ripard wrote: >>> >>> I 
agree that we need a baseline, but that baseline should be >>> defined 
by the tests own merits, not their outcome on a >>> particular platform. 
>>> >>> In other words, I want all drivers to follow that baseline, and 
>>> if they don't it's a bug we should fix, and we should be vocal >>> 
about it. We shouldn't ignore the test because it's broken. >>> >>> 
Going back to the example I used previously, >>> 
kms_hdmi_inject@inject-4k shouldn't fail on mt8173, ever. That's >>> a 
bug. Ignoring it and reporting that "all tests are good" isn't >>> ok. 
There's something wrong with that driver and we should fix >>> it. >>> 
>>> Or at the very least, explain in much details what is the >>> 
breakage, how we noticed it, why we can't fix it, and how to >>> 
reproduce it. >>> >>> Because in its current state, there's no chance 
we'll ever go >>> over that test list and remove some of them. Or even 
know if, if >>> we ever fix a bug somewhere, we should remove a flaky or 
failing >>> test. >>> >>> [...] >>> >>>> we need to have a clear view 
about which tests are not >>>> corresponding to it, so we can start 
fixing. First we need to >>>> be aware of the issues so we can start 
fixing them, otherwise >>>> we will stay in the "no tests no failures" 
ground :) >>> >>> I think we have somewhat contradicting goals. You want 
to make >>> regression testing, so whatever test used to work in the 
past >>> should keep working. That's fine, but it's different from >>> 
"expectations about what the DRM drivers are supposed to pass in >>> the 
IGT test suite" which is about validation, ie "all KMS >>> drivers must 
behave this way". >> >> [...] >> >> >> We could have some policy: if you 
want to enable a certain device >> in the CI, you need to make sure it 
passes all tests first to force >> people to go fix the issues, but 
maybe it would be a big barrier. >> >> I'm afraid that, if a test fail 
(and it is a clear bug), people >> would just say "work for most of the 
cases, this is not a priority >> to fix" and just start ignoring the CI, 
this is why I think >> regression tests is a good way to start with. > > 
I think eventually we need to get to both goals, but currently > driver 
and test quality just isn't remotely there. > > I think a good approach 
would be if CI work focuses on the pure sw > tests first, so kunit and 
running igt against vgem/vkms. And then we > could use that to polish a 
set of must-pass igt testcases, which > also drivers in general are 
supposed to pass. Plus ideally weed out > the bad igts that aren't 
reliable enough or have bad assumptions. > > For hardware I think it 
will take a very long time until we get to a > point where CI can work 
without a test result list, we're nowhere > close to that. But for 
virtual driver this really should be > achievable, albeit with a huge 
amount of effort required to get > there I think.
Yeah, this is what our experience with Mesa (in particular) has taught us.

Having 100% of the tests pass 100% of the time on 100% of the platforms 
is a great goal that everyone should aim for. But it will also never happen.

Firstly, we're just not there yet today. Every single GPU-side DRM 
driver has userspace-triggerable faults which cause occasional errors in 
GL/Vulkan tests. Every single one. We deal with these in Mesa by 
retrying; if we didn't retry, across the breadth of hardware we test, 
I'd expect 99% of should-succeed merges to fail because of these 
intermittent bugs in the DRM drivers. We don't have the same figure for 
KMS - because we don't test it - but I'd be willing to bet no driver is 
100% if you run tests often enough.

Secondly, we will never be there. If we could pause for five years and 
sit down making all the current usecases for all the current hardware on 
the current kernel run perfectly, we'd probably get there. But we can't: 
there's new hardware, new userspace, and hundreds of new kernel trees. 
Even without the first two, what happens when the Arm SMMU maintainers 
(choosing a random target to pick on, sorry Robin) introduce subtle 
breakage which makes a lot of tests fail some of the time? Do we refuse 
to backmerge Linus into DRM until it's fixed, or do we disable all 
testing on Arm until it's fixed? When we've done that, what happens when 
we re-enable testing, and discover that a bunch of tests get broken 
because we haven't been testing?

Thirdly, hardware is capricious. 'This board doesn't make it to u-boot' 
is a clear infrastructure error, but if you test at sufficient scale, 
cold solder or failing caps surface way more often than you might think. 
And you can't really pick those out by any other means than running at 
scale, dealing with non-binary results, and looking at the trends over 
time. (Again this is something we do in Mesa - we graph test failures 
per DUT, look for outliers, and pull DUTs out of the rotation when 
they're clearly defective. But that only works if you actually run 
enough tests on them in the first place to discover trends - if you stop 
at the first failed test, it's impossible to tell the difference between 
'infuriatingly infrequent kernel/test bug?' and 'cracked main board 
maybe?'.)

What we do know is that we _can_ classify tests four ways in 
expectations. Always-passing tests should always pass. Always-failing 
tests should always fail (and update the expectations if you make them 
pass). Flaking tests work often enough that they'll always pass if you 
run them a couple/few times, but fail often enough that you can't rely 
on them. Then you just skip tests which exhibit catastrophic failure 
i.e. local DoS which affects the whole test suite.

By keeping those sets of expectations, we've been able to keep Mesa 
pretty clear of regressions, whilst having a very clear set of things 
that should be fixed to point to. It would be great if those set of 
things were zero, but it just isn't. Having that is far better than the 
two alternatives: either not testing at all (obviously bad), or having 
the test always be red so it's always ignored (might as well just not test).

Cheers,

Daniel