Re: Gating Fedora updates on Fedora CoreOS CI

Dusty Mabe <dusty@xxxxxxxxxxxxx> · Mon, 17 Feb 2025 21:30:32 -0500

On 2/17/25 1:32 PM, Adam Williamson wrote:
> On Mon, 2025-02-17 at 11:06 +0100, Clement Verna wrote:
>> On Sat, 15 Feb 2025 at 20:51, Adam Williamson <adamwill@xxxxxxxxxxxxxxxxx>
>> wrote:
>>
>>> On Sat, 2025-02-15 at 14:54 +0000, Zbigniew Jędrzejewski-Szmek wrote:
>>>> On Fri, Feb 14, 2025 at 02:40:29PM -0800, Adam Williamson wrote:
>>>>> On Fri, 2025-02-14 at 16:31 -0500, Dusty Mabe wrote:
>>>>>> IMO the bar would only need to be that high if the user had no way
>>> to ignore the test results.
>>>>>> All gating does here (IIUC) is require them to do an extra step
>>> before it automatically flows
>>>>>> into the next rawhide compose.
>>>>>
>>>>> again, technically, yes, but *please* let's not train people to have a
>>>>> pavlovian reaction to waive failures, that is not the way.
>>>>
>>>> IMO, the bar for *gating* tests needs to be high. I think 95% true
>>>> positives would be a reasonable threshold.
>>>
>>> Do you mean 95% of failures must be 'real' (i.e. up to 5% can be
>>> 'false')? Is this after automatic retries and manual intervention by
>>> the test system maintainers, or before?
>>>
>>> Off the top of my head, 95% seems low. I'm pretty sure we do better
>>> than that with openQA and people would complain if that was all we
>>> managed. We usually maintain a 0% false failure rate after auto-retries
>>> and <24h manual intervention -
>>>
>>> https://openqa.fedoraproject.org/group_overview/2?limit_builds=100&limit_builds=400
>>> has 0 false failures ATM.
>>>
>>
>> Thanks, that's interesting. What do you call <24h manual intervention? One
>> example that comes to mind would be to disable or snooze a test that
>> started to trigger false failures in under 24h.
>> If that's the case, I think that sounds achievable.
> 
> We've done that very occasionally in dire emergencies (though what
> you'd actually want to do is disable *gating* on the test, not disable
> the test itself - this is an edit to the greenwave policy).
> 
> Usually what it means is "rerun the test if it just flaked twice, or
> fix the problem if there's a specific problem causing the failure that
> is not a bug in the update itself". that could mean updating one of the
> openQA screenshots, for instance, or fixing a bug in the test logic, or
> working with releng/infra to fix a bug that's causing tests to fail,
> e.g. pagure not responding (and then rerunning all the tests that
> failed).
>
> 
> If the failure is caused by a real bug in the update, we usually write
> up an explanation of the issue as a comment in Bodhi, or a bug report
> with a link from Bodhi.

What you are describing is exactly what we do. Either:

1. Fix the source of the false failure (i.e. test needs updating)
2. Retry because it was a infra or network flake (and possibly open PR to make test more robust)
3. Try to open an issue against failed component and report negative karma in bug
   because we have high confidence it's an actual issue in the update.

It just happens that the messiness between 1. and 2. is something we aren't currently
accounting for in the metric Clement posted. Of course when you stamp out flakes and
sources of false failures and run the tests one final time and it passes then you can
get your false failure rate down to 0% :) 

Dusty
-- 
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue