Hi Maxime, Hopefully less mangled formatting this time: turns out Thunderbird + plain text is utterly unreadable, so that's one less MUA that is actually usable to send email to kernel lists without getting shouted at. On Mon, 11 Sept 2023 at 15:46, Maxime Ripard <mripard@xxxxxxxxxx> wrote: > On Mon, Sep 11, 2023 at 03:30:55PM +0200, Michel Dänzer wrote: > > > There's in 6.6-rc1 around 240 reported flaky tests. None of them have > > > any context. That new series hads a few dozens too, without any context > > > either. And there's no mention about that being a plan, or a patch > > > adding a new policy for all tests going forward. > > > > That does sound bad, would need to be raised in review. > > > > > Any concern I raised were met with a giant "it worked on Mesa" handwave > > > > Lessons learned from years of experience with big real-world CI > > systems like this are hardly "handwaving". > > Your (and others) experience certainly isn't. It is valuable, welcome, > and very much appreciated. > > However, my questions and concerns being ignored time and time again > about things like what is the process is going to be like, what is going > to be tested, who is going to be maintaining that test list, how that > interacts with stable, how we can possibly audit the flaky tests list, > etc. have felt like they were being handwaived away. Sorry it ended up coming across like that. It wasn't the intent. > I'm not saying that because I disagree, I still do on some, but that's > fine to some extent. However, most of these issues are not so much an > infrastructure issue, but a community issue. And I don't even expect a > perfect solution right now, unlike what you seem to think. But I do > expect some kind of plan instead of just ignoring that problem. > > Like, I had to ask the MT8173 question 3 times in order to get an > answer, and I'm still not sure what is going to be done to address that > particular issue. > > So, I'm sorry, but I certainly feel like it here. I don't quite see the same picture from your side though. For example, my reading of what you've said is that flaky tests are utterly unacceptable, as are partial runs, and we shouldn't pretend otherwise. With your concrete example (which is really helpful, so thanks), what happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing until it's perfect, or does MT8173 testing always fail because that test does? Both have their downsides. Not doing any testing has the obvious downside, and means that the driver can get worse until it gets perfect. Always marking the test as failed means that the test results are useless: if failure is expected, then red is good. I mean, say you're contributing a patch to fix some documentation or add a helper to common code which only v3d uses. The test results come back, and your branch is failing tests on MT8173, specifically the hdmi-inject@4k test. What then? Either as a senior contributor you 'know' that's the case, or as a casual contributor you get told 'oh yeah, don't worry about the test results, they always fail'. Both lead to the same outcome, which is that no-one pays any attention to the results, and they get worse. What we do agree on is that yes, those tests should absolutely be fixed, and not just swept under the rug. Part of this is having maintainers actually meaningfully own their test results. For example, I'm looking at the expectation lists for the Intel gen in my laptop, and I'm seeing a lot of breakage in blending tests, as well as dual-display fails which include the resolution of my external display. I'd expect the Intel driver maintainers to look at them, get them fixed, and gradually prune those xfails/flakes down towards zero. If the maintainers don't own it though, then it's not going to get fixed. And we are exactly where we are today: broken plane blending and 1440p on KBL, broken EDID injection on MT8173, and broken atomic commits on stoney. Without stronger action from the maintainers (e.g. throwing i915 out of the tree until it has 100% pass 100% of the time), adding testing isn't making the situation better or worse in and of itself. What it _is_ doing though, is giving really clear documentation of the status of each driver, and backing that up by verifying it. Only maintainers can actually fix the drivers (or the tests tbf). But doing the testing does let us be really clear to everyone what the actual state is, and that way people can make informed decisions too. And the only way we're going to drive the test rate down is by the subsystem maintainers enforcing it. Does that make sense on where I'm (and I think a lot of others are) coming from? To answer the other question about 'where are the logs?': some of them have the failure data in them, others don't. They all should going forward at least though. Cheers, Daniel