Re: [libvirt PATCH] ci: Reduce number of stages

Andrea Bolognani <abologna@xxxxxxxxxx> · Wed, 10 Jun 2020 17:15:55 +0200

On Wed, 2020-06-10 at 13:31 +0100, Daniel P. Berrangé wrote:
> On Wed, Jun 10, 2020 at 01:14:51PM +0100, Daniel P. Berrangé wrote:
> > On Wed, Jun 10, 2020 at 01:33:01PM +0200, Andrea Bolognani wrote:
> > > Building artifacts in a separate pipeline stage also doesn't have any
> > > advantages, and only delays further stages by a couple of minutes.
> > > The only job that really makes sense in its own stage is the DCO
> > > check, because it's extremely fast (less than 1 minute) and, if that
> > > fails, we can avoid kicking off all other jobs.
> > 
> > The advantage of using stages is that it makes it easy to see at a
> > glance where the pipeline was failing.

Ultimately you'll need to drill down to the actual failure, though,
so the only situation in which it would really provide value is if
for some reason *all* cross builds failed at once, which is not
something that happens frequently enough to optimize for.

> > > Reducing the number of stages results in significant speedups:
> > > specifically, going from three stages to two stages reduces the
> > > overall completion time for a full CI pipeline from ~45 minutes[1]
> > > to ~30 minutes[2].
> > > 
> > > [1] https://gitlab.com/abologna/libvirt/-/pipelines/154751893
> > > [2] https://gitlab.com/abologna/libvirt/-/pipelines/154771173
> > 
> > I don't think this time comparison is showing a genuine difference.
> > 
> > If we look at the original staged pipeline, every single individual
> > job took much longer than every individual jobs in the simplified
> > pipeline. I think the difference in job times accounts for most
> > (possibly all) of the difference in the pipelines time.
> > 
> > If we look at the history of libvirt pipelines:
> > 
> >    https://gitlab.com/libvirt/libvirt/pipelines
> > 
> > the vast majority of the time we're completing in 30 minutes or
> > less already.

That was before introducing FreeBSD builds, which for whatever reason
take a significantly longer time: the last couple of jobs both took
50+ minutes. Installing packages is very inefficient, it would seem.

Either way, even looking at earlier jobs, it seems clear that we
leave compute time on the table: for the last 10 jobs before adding
FreeBSD, we have

  Longest job | Shortest job
  ------------ -------------
        21:20 | 12:12
        16:11 | 09:04
        21:31 | 13:40
        16:32 | 08:28
        14:53 | 08:16
        16:01 | 07:59
        16:17 | 08:40
        15:30 | 08:49
        15:12 | 09:11
        16:20 | 08:34

which means the pipeline is stalled for at least 5-8 minutes each
time. That's time that we could use to run builds, but we just sit
idly and wait instead. The difference becomes even bigger with
FreeBSD in the mix.

Even from a more semantical point of view, pipeline stages exist to
implement dependencies between jobs: a good example is our container
build jobs, which of course need to happen *before* the build job
that uses that container can start. There are no dependencies
whatsoever between native builds and cross builds.

> > If you want to demonstrate an time improvement from these merged
> > stages, then run 20 pipelines over a cople of days and show
> > that they're consistently better than what we see already, and
> > not just a reflection of the CI infra load at a point in time.

I could do that, sure, it just seems like a waste of shared runner
CPU time...

> Also remember that we're using ccache, so slower builds may just be a
> reflection of the ccache having low hit rate - a sequence of repeated
> builds of the same branch should identify if that's the case.

I've been running builds pretty much non-stop over the past few days,
and since the cache is keyed off the job's name there should be no
significant skew caused by this.

-- 
Andrea Bolognani / Red Hat / Virtualization