Re: [libvirt PATCH 4/4] gitlab-ci: Introduce a new test 'integration' pipeline stage

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Wed, 2 Mar 2022 16:32:33 +0000

On Wed, Mar 02, 2022 at 05:18:26PM +0100, Erik Skultety wrote:
> On Wed, Mar 02, 2022 at 04:42:21PM +0100, Jiri Denemark wrote:
> > On Wed, Mar 02, 2022 at 09:43:24 +0000, Daniel P. Berrangé wrote:
> > > On Wed, Mar 02, 2022 at 08:42:30AM +0100, Erik Skultety wrote:
> > > > ...
> > > Ultimately when we switch to using merge requests, the integration tests
> > > should be run as a gating job, triggered from the merge train when the
> > > code gets applied to git, so that we prevent regressions actually making
> > > it into git master at all.
> > > 
> > > Post-merge integration testing always exhibits the problem that people will
> > > consider it somebody else's problem to fix the regression. So effectively
> > > whoever creates the integration testing system ends up burdened with the
> > > job of investigating failures and finding someone to poke to fix it. With
> > > it run pre-merge then whoever wants to get their code merged needs to
> > > investigate the problems. Now sometimes the problems with of course be
> > > with the integration test system itself, not the submitters code, but
> > > this is OK because it leads to situation where the job of maintaining
> > > the integration tests are more equitably spread across all involved and
> > > builds a mindset that functional / integration testing is a critical
> > > part of delivering code, which is something we've lacked for too long
> > > in libvirt.
> > 
> > This is all fine as long as there's a way to explicitly merge something
> > even if CI fails. I don't like waiting for a fix in something completely
> > unrelated. For example, a MR with translations or an important bug fix
> 
> I haven't studied GitLab enough to rule it out straight away, but I'd like to
> have some more intelligent hook (or whatnot) in place that could detect what
> type of change is proposed so that documentation changes and translations would
> not go through the whole pipeline and instead would be merged almost
> immediately (after passing e.g. the docs build).

Yep, for translations all we really need todo is trigger the msgmerge
commands to validate that printf format specifiers aren't messed up,
so that one is quite easy.

In theory we could do something similar for docs only changes, but
not sure how amenable our meson rules are to a
our meson rules dont

For docs our 'website' gitlab job only builds the HTML docs skipping
the code. So we could potentially filter that too

> An important bug fix however must go through the whole pipeline IMO whether we
> like it or not, that's the point, we need to ensure that we're not introducing
> a regression to the best of our capabilities. As for CI infrastructure issues,
> well, unfortunately that is an inevitable part of doing any automated tasks. By
> introducing it upstream it should therefore become a collective responsibility,
> so that if a CI issue occurs we need to work together to resolve it ASAP rather
> than going around it and pushing a fix just because it takes long to execute.

I think the idea of skipping an unreliable pipeline is the wrong
way to approach it. For something to be gating the expectation
has to be that it is reliable. If something is reliable and it
fails with a genuine problem, then it is collective responsibility
to drop everything and focus on fixing that problem, not skip it.

Sometimes this may cause a delay in merging stuff but that is
a good thing, as we've certainly had enough cases over the years
where obvious trivial fixes were pushed but were in fact broken.

If something breaks and the fix is going to be difficult or
out of our control, then we might have no choice but to disable
the test jobs completely. For this it should be a case of setting
an env var in the gitlab config to skip certain job names, which
we can unset once the infra is fixed. An example of this would
be where the Cirrus CI API changed recently and completely broke
cirrus-run and took a week to fix, and I unset the CIRRUS CI token
in the repo. 

In the case that some jobs are not consistently not reliable then
there are a few relevant actions to consider. For example we have
ultimately decided to make the all the bleeding edge distros be
non-gating because Rawhide/Sid/Tumbleweed regularly broke due to
their maintainers pushing broken packages.

If we have a test of our own that is periodically failing in a
non-determinstic way, then we should immediately disable that
test in the test suite, rather than skipping gating CI. The test
should only be re-enabled once its reliability is fixed. Tests
which are gating must be reliable without false positives.

In the absolute worst case, someone with admin permissions can
just toggle the 'pipelines must succeed' option in the gitlab
repo admin UI, but if it gets to that point we've really failed
badly.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|