Re: ELN build order (was: Re: OCaml 5.1 rebuild)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Oct 6, 2023 at 11:16 AM Stephen Gallagher <sgallagh@xxxxxxxxxx> wrote:
...
> So, as we all know, build ordering is hard (and, despite intuitive
> belief, not actually deterministic).
>
> ELN actually "cheats" somewhat when we do our builds. When we process
> a batch of builds (triggered by a set of tag events that come in all
> at the same time, such as when a side-tag is merged), we create a new
> ELN side-tag, tag all of the new *Rawhide* builds into this side tag,
> then trigger a rebuild of all of those builds for ELN. The result here
> is that we use the Fedora build in the buildroot to avoid
> bootstrapping issues. Now, there are some special-case packages for
> which we do *NOT* automatically tag in the Fedora builds because they
> have known incompatibilities. All OCAML packages fall into this
> category, since we discovered about 9 months ago that we absolutely
> cannot mix Fedora's OCAML builds with ELN's (I don't recall the exact
> reason).
>
> Our other "cheat" when we rebuild a batch is that we automatically do
> rebuilds at least once for any failure, to account for things like
> build ordering and test flakes. In the case you're describing,
> unforunately, we had a situation where 1) it was OCAML, and therefore
> the Fedora packages weren't in the buildroot and 2) ocaml built
> successfully against the older version of the macros. If the
> BuildRequires: had been in play there, it would have failed, the batch
> would have finished building whatever else was able to succeed (such
> as the macros) and then the second pass would have succeeded.
>
> I'm sorry you got hit by this, Richard. It's an unfortunate confluence
> of limitations in our rebuild approach.


I thought I'd responded here the other day with the link, but I forgot:
https://sgallagh.wordpress.com/2023/10/13/sausage-factory-fedora-eln-rebuild-strategy/

I've put together a blog post describing our ELN rebuild strategy,
which I'll copy below (but clarifications or additional content may be
added later, so consider that link the most up-to-date version).


---

# Fedora ELN Rebuild Strategy: The Rebuild Algorithm (2023 Edition)

## Slow and Steady Wins the Race

The Fedora ELN SIG maintains a tool called ELNBuildSync[1] (or EBS)
which is responsible for monitoring traffic on the Fedora Messaging
Bus and listening for Koji tagging events. When a package is tagged
into Rawhide (meaning it has passed Fedora QA Gating and is headed to
the official repositories), EBS checks whether it’s on the list of
packages targeted for Fedora ELN or ELN Extras and enqueues it for the
next batch of builds.

A batch begins when there are one or more enqueued builds and at least
five wallclock seconds have passed since a build has been enqueued.
This allows EBS to capture events such as a complete side-tag being
merged into Rawhide at once; it will always rebuild those together in
a batch. Once a batch begins, all other messages are enqueued for the
following batch. When the current batch is complete, a new batch will
begin.

The first thing that is done when processing a batch is to create a
new side-tag derived from the ELN buildroot. Into this new target, EBS
will tag most of the Rawhide builds. It will then wait until Koji has
regenerated the buildroot for the batch tag before triggering the
rebuild of the batched packages. This strategy avoids most of the
ordering issues (particularly bootstrap loops) inherent in rebuilding
a side-tag, because we can rely on the Rawhide builds having already
succeeded.

Once the rebuild is ready to begin, EBS interrogates Koji for the
original git commit used to build each Rawhide package (in case git
has seen subsequent, unbuilt changes). The builds are then triggered
in the side tag concurrently. EBS monitors these builds for
completion. If one or more builds in a batch fails, EBS will re-queue
it for another rebuild attempt. This repeats until the same set of
failures occurs twice in a row. Once all of the rebuild attempts have
concluded, EBS tags all successful builds back to ELN and removes the
side tag. Then it moves on to preparing another batch, if there are
packages waiting.

## History

In its first incarnation, ELNBuildSync (at the time known as
DistroBuildSync) was very simplistic. It listened for tag events on
Rawhide, checked them against its list and then triggered a build in
the ELN target. Very quickly, the ELN SIG realized that this had
significant limitations, particularly in the case of packages building
in side-tags (which was becoming more common as the era of on-demand
side-tags began). One of the main benefits of side-tags is the ability
to rebuild packages that depend on one another in the proper order;
this was lost in the BuildSync process and many times builds were
happening out of order, resulting in packages with the same NVR as
Rawhide but incorrectly built against older versions of their
dependencies.

Initially, the ELN SIG tried to design a way to exactly mirror the
build process in the side-tags, but that resulted in its own new set
of problems. First of all, it would be very slow; the only way to
guarantee that side-tags are built against the same version of their
dependencies as the Rawhide version would be to perform all of those
builds serially. Secondly, even determining the order of operations in
a side-tag after it already happened turned out to be prohibitively
difficult.

Instead, the ELN SIG recognized that the Fedora Rawhide packagers had
already done the hardest part. Instead of trying to replicate their
work in an overly-complicated manner, instead the tool would just take
advantage of the existing builds. Now, prior to triggering a build for
ELN, the tool would first tag the current Rawhide builds into ELN and
wait for them to be added to the Koji buildroot. This solved about 90%
of the problems in a generic manner without engineering an excessively
complicated side-tag approach. Naturally, it wasn’t a perfect
solution, but it got a lot further. (See below for “Why are some
package not tagged into the batch side-tag?” for more details.

The most recent modification to this strategy came about as CentOS
Stream 10 started to come into the picture. With the intent to
bootstrap CS 10 initially from ELN, tagging Rawhide packages to the
ELN tag suddenly became a problem, as CS 10 needs to use that tag
event as its trigger. The solution here was not to tag Rawhide builds
into Fedora ELN directly, but instead to create a new ELN side-tag
target where we could tag them, build the ELN packages there and then
tag the successful builds into ELN. As a result, CS 10 builds are only
triggered on ELN successes.

## Frequently Asked Questions

### Why does it sometimes take a long time for my package to be rebuilt?

Not all batches are created equal. Sometimes, there will be an ongoing
batch with one or more packages whose build takes a very long time to
complete. (e.g. gcc, firefox, chromium). This can lead to up to a
day’s lag in even getting enqueued. Even if your package was part of
the same batch, it will still wait for all packages in the batch to
complete before the tag occurs.

### Why do batches not run in parallel?

Simply put, until the previous batch is complete, there’s no way to
know if a further batch relies on one or more changes from the
previous batch. This is a problem we’re hoping might have a solution
down the line, if it becomes possible to create “nested” side-tags
(side-tags derived from another side-tag instead of a base tag). Today
however, serialization is the only safe approach.

### Why are some packages not tagged into the batch side-tag?

Some packages have known incompatibilities, such as libllvm and OCAML.
The libraries produced in the ELN build and Rawhide build are API or
ABI incompatible and therefore cannot be tagged in safely. We have to
rely on the previous ELN version of the build in the buildroot.

### Why do you not tag successes back into ELN immediately?

Not all ELN packages are built by the auto-rebuilder. Several are
maintained individually for various reasons (the kernel, ceph,
crypto-policies, etc.). We don’t want to tag a partial batch in out of
concern that this could break these other builds.


[1] Technically, the repository is called DistroBuildSync because
originally it was meant to serve multiple purposes of rebuilding ELN
from Rawhide and also syncing builds for CentOS Stream and RHEL.
However, the latter two ended up forking off very significantly, so we
renamed ours to ELNBuildSync to reduce confusion between them. It
unfortunately retains the old name for the repo at the moment due to
deployment-related reasons. ↩︎
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Fedora Announce]     [Fedora Users]     [Fedora Kernel]     [Fedora Testing]     [Fedora Formulas]     [Fedora PHP Devel]     [Kernel Development]     [Fedora Legacy]     [Fedora Maintainers]     [Fedora Desktop]     [PAM]     [Red Hat Development]     [Gimp]     [Yosemite News]

  Powered by Linux