Hi (Removing most of the context that got scrambled) On Thu, Sep 07, 2023 at 01:40:02PM +0200, Daniel Stone wrote: > Yeah, this is what our experience with Mesa (in particular) has taught us. > > Having 100% of the tests pass 100% of the time on 100% of the platforms is a > great goal that everyone should aim for. But it will also never happen. > > Firstly, we're just not there yet today. Every single GPU-side DRM driver > has userspace-triggerable faults which cause occasional errors in GL/Vulkan > tests. Every single one. We deal with these in Mesa by retrying; if we > didn't retry, across the breadth of hardware we test, I'd expect 99% of > should-succeed merges to fail because of these intermittent bugs in the DRM > drivers. So the plan is only to ever test rendering devices? It should have been made clearer then. > We don't have the same figure for KMS - because we don't test it - but > I'd be willing to bet no driver is 100% if you run tests often enough. And I would still consider that a bug that we ought to fix, and certainly not something we should sweep under the rug. If half the tests are not running on a driver, then fine, they aren't. I'm not really against having failing tests, I'm against not flagging unreliable tests on a given hardware as failing tests. > Secondly, we will never be there. If we could pause for five years and sit > down making all the current usecases for all the current hardware on the > current kernel run perfectly, we'd probably get there. But we can't: there's > new hardware, new userspace, and hundreds of new kernel trees. Not with that attitude :) I'm not sure it's actually an argument, really. 10 years ago, we would never have been at "every GPU on the market has an open-source driver" here. 5 years ago, we would never have been at this-series-here. That didn't stop anyone making progress, everyone involved in that thread included. > Even without the first two, what happens when the Arm SMMU maintainers > (choosing a random target to pick on, sorry Robin) introduce subtle > breakage which makes a lot of tests fail some of the time? Do we > refuse to backmerge Linus into DRM until it's fixed, or do we disable > all testing on Arm until it's fixed? When we've done that, what > happens when we re-enable testing, and discover that a bunch of tests > get broken because we haven't been testing? I guess that's another thing that needs to be made clearer then. Do you want to test Mesa, or the kernel? For Mesa, I'd very much expect to rely on a stable kernel, and for the kernel on a stable Mesa. And if we're testing the kernel, then let's turn it the other way around. How are we even supposed to detect those failures in the first place if tests are flagged as unreliable? No matter what we do here, what you describe will always happen. Like, if we do flag those tests as unreliable, what exactly prevents another issue to come on top undetected, and what will happen when we re-enable testing? On top of that, you kind of hinted at that yourself, but what set of tests will pass is a property linked to a single commit. Having that list within the kernel already alters that: you'll need to merge a new branch, add a bunch of fixes and then change the test list state. You won't have the same tree you originally tested (and defined the test state list for). It might or might not be an issue for Linus' release, but I can definitely see the trouble already for stable releases where fixes will be backported, but the test state list certainly won't be updated. > Thirdly, hardware is capricious. 'This board doesn't make it to u-boot' is a > clear infrastructure error, but if you test at sufficient scale, cold solder > or failing caps surface way more often than you might think. And you can't > really pick those out by any other means than running at scale, dealing with > non-binary results, and looking at the trends over time. (Again this is > something we do in Mesa - we graph test failures per DUT, look for outliers, > and pull DUTs out of the rotation when they're clearly defective. But that > only works if you actually run enough tests on them in the first place to > discover trends - if you stop at the first failed test, it's impossible to > tell the difference between 'infuriatingly infrequent kernel/test bug?' and > 'cracked main board maybe?'.) > > What we do know is that we _can_ classify tests four ways in expectations. > Always-passing tests should always pass. Always-failing tests should always > fail (and update the expectations if you make them pass). Flaking tests work > often enough that they'll always pass if you run them a couple/few times, > but fail often enough that you can't rely on them. Then you just skip tests > which exhibit catastrophic failure i.e. local DoS which affects the whole > test suite. > > By keeping those sets of expectations, we've been able to keep Mesa pretty > clear of regressions, whilst having a very clear set of things that should > be fixed to point to. It would be great if those set of things were zero, > but it just isn't. Having that is far better than the two alternatives: > either not testing at all (obviously bad), or having the test always be red > so it's always ignored (might as well just not test). Isn't that what happens with flaky tests anyway? Even more so since we have 0 context when updating that list. I've asked a couple of times, I'll ask again. In that other series, on the MT8173, kms_hdmi_inject@inject-4k is setup as flaky (which is a KMS test btw). I'm a maintainer for that part of the kernel, I'd like to look into it, because it's seriously something that shouldn't fail, ever, the hardware isn't involved. How can I figure out now (or worse, let's say in a year) how to reproduce it? What kernel version was affected? With what board? After how many occurences? Basically, how can I see that the bug is indeed there (or got fixed since), and how to start fixing it? And then repeat for any other test listed in there. I got no other reply before because I very well know the answer: nobody knows. And that's a serious issue to me, because that effectively means that the flaky test list will only ever increase (since we can't even check that it's fixed, and the CI infrastructure won't check that it got fixed either), and we won't be able to address any of the bugs listed there. Maxime
Attachment:
signature.asc
Description: PGP signature