Re: Gating Fedora updates on Fedora CoreOS CI

Dusty Mabe <dusty@xxxxxxxxxxxxx> · Fri, 14 Feb 2025 16:31:13 -0500

On 2/14/25 5:07 AM, Pierre-Yves Chibon wrote:
> On Fri, Feb 14, 2025 at 10:34:51AM +0100, Clement Verna wrote:
>>    On Thu, 13 Feb 2025 at 17:44, Kevin Fenzi <[1]kevin@xxxxxxxxx> wrote:
>>
>>      I agree with downthread folks that that seems like way too high a
>>      failure rate to enable gating on. However, a few questions if I can:
>>
>>    Yes the failure rate is quite high and most of these are real failures,
>>    that we deal with in Fedora CoreOS. So I am reading this like, because the
>>    tests are catching too many failures we should continue ignoring them 🫤
> 
> I think what is scaring people with the data you've provided is that we do not
> know which %/numbers of these failures are genuine failures that should gate the
> update because they are bugs vs infrastructure/pipeline issues.
> Would you have a way to distinguish between the two? Basically a failure vs
> error output.

I think what you bring up here is valid and I think in our next round of metrics
we will come up with a way to classify the failures so we can get a better idea.

However, I'd like to propose that we don't let this discourage us from moving forward.
You've raised concerns and we hear you. What I will say, though, is that we do monitor
these failures (hence the matrix channel) and we do restart tests if we believe they
are failing due to flakes or issues on our side.

In other words, if the failure is believed to be on our side we try to resolve the issue
without package maintainers needing to do anything.

Now, will we always be looking at them in realtime? No. However, I would propose that we
gate by default and try to give some time to determine the root cause before waiving.

> 
> The push-back I'm hearing is more toward: there are a lot of failures here and
> if they are all related to infrastructure issues then we're going to cause
> disruption without a clear benefits.

I'd like to push back slightly on the word "disruption" here. IMO disruption is more
applicable in the case where a test fails (keep in mind we are already running the tests
and reporting the results) and it goes in anyway and causes issues in downstream built
artifacts. We (Fedora as a whole) were given bad results and it went in anyway.

> Now if you're able to say: "95% of these errors are genuine bug that today are
> impacting our users despite our pipeline having found it and 5% are
> infrastructure related", that's a different story :)

IMO the bar would only need to be that high if the user had no way to ignore the test results.
All gating does here (IIUC) is require them to do an extra step before it automatically flows
into the next rawhide compose.

Dusty
-- 
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue