On Fri, Oct 6, 2023 at 11:16 AM Stephen Gallagher <sgallagh@xxxxxxxxxx> wrote: ... > So, as we all know, build ordering is hard (and, despite intuitive > belief, not actually deterministic). > > ELN actually "cheats" somewhat when we do our builds. When we process > a batch of builds (triggered by a set of tag events that come in all > at the same time, such as when a side-tag is merged), we create a new > ELN side-tag, tag all of the new *Rawhide* builds into this side tag, > then trigger a rebuild of all of those builds for ELN. The result here > is that we use the Fedora build in the buildroot to avoid > bootstrapping issues. Now, there are some special-case packages for > which we do *NOT* automatically tag in the Fedora builds because they > have known incompatibilities. All OCAML packages fall into this > category, since we discovered about 9 months ago that we absolutely > cannot mix Fedora's OCAML builds with ELN's (I don't recall the exact > reason). > > Our other "cheat" when we rebuild a batch is that we automatically do > rebuilds at least once for any failure, to account for things like > build ordering and test flakes. In the case you're describing, > unforunately, we had a situation where 1) it was OCAML, and therefore > the Fedora packages weren't in the buildroot and 2) ocaml built > successfully against the older version of the macros. If the > BuildRequires: had been in play there, it would have failed, the batch > would have finished building whatever else was able to succeed (such > as the macros) and then the second pass would have succeeded. > > I'm sorry you got hit by this, Richard. It's an unfortunate confluence > of limitations in our rebuild approach. I thought I'd responded here the other day with the link, but I forgot: https://sgallagh.wordpress.com/2023/10/13/sausage-factory-fedora-eln-rebuild-strategy/ I've put together a blog post describing our ELN rebuild strategy, which I'll copy below (but clarifications or additional content may be added later, so consider that link the most up-to-date version). --- # Fedora ELN Rebuild Strategy: The Rebuild Algorithm (2023 Edition) ## Slow and Steady Wins the Race The Fedora ELN SIG maintains a tool called ELNBuildSync[1] (or EBS) which is responsible for monitoring traffic on the Fedora Messaging Bus and listening for Koji tagging events. When a package is tagged into Rawhide (meaning it has passed Fedora QA Gating and is headed to the official repositories), EBS checks whether it’s on the list of packages targeted for Fedora ELN or ELN Extras and enqueues it for the next batch of builds. A batch begins when there are one or more enqueued builds and at least five wallclock seconds have passed since a build has been enqueued. This allows EBS to capture events such as a complete side-tag being merged into Rawhide at once; it will always rebuild those together in a batch. Once a batch begins, all other messages are enqueued for the following batch. When the current batch is complete, a new batch will begin. The first thing that is done when processing a batch is to create a new side-tag derived from the ELN buildroot. Into this new target, EBS will tag most of the Rawhide builds. It will then wait until Koji has regenerated the buildroot for the batch tag before triggering the rebuild of the batched packages. This strategy avoids most of the ordering issues (particularly bootstrap loops) inherent in rebuilding a side-tag, because we can rely on the Rawhide builds having already succeeded. Once the rebuild is ready to begin, EBS interrogates Koji for the original git commit used to build each Rawhide package (in case git has seen subsequent, unbuilt changes). The builds are then triggered in the side tag concurrently. EBS monitors these builds for completion. If one or more builds in a batch fails, EBS will re-queue it for another rebuild attempt. This repeats until the same set of failures occurs twice in a row. Once all of the rebuild attempts have concluded, EBS tags all successful builds back to ELN and removes the side tag. Then it moves on to preparing another batch, if there are packages waiting. ## History In its first incarnation, ELNBuildSync (at the time known as DistroBuildSync) was very simplistic. It listened for tag events on Rawhide, checked them against its list and then triggered a build in the ELN target. Very quickly, the ELN SIG realized that this had significant limitations, particularly in the case of packages building in side-tags (which was becoming more common as the era of on-demand side-tags began). One of the main benefits of side-tags is the ability to rebuild packages that depend on one another in the proper order; this was lost in the BuildSync process and many times builds were happening out of order, resulting in packages with the same NVR as Rawhide but incorrectly built against older versions of their dependencies. Initially, the ELN SIG tried to design a way to exactly mirror the build process in the side-tags, but that resulted in its own new set of problems. First of all, it would be very slow; the only way to guarantee that side-tags are built against the same version of their dependencies as the Rawhide version would be to perform all of those builds serially. Secondly, even determining the order of operations in a side-tag after it already happened turned out to be prohibitively difficult. Instead, the ELN SIG recognized that the Fedora Rawhide packagers had already done the hardest part. Instead of trying to replicate their work in an overly-complicated manner, instead the tool would just take advantage of the existing builds. Now, prior to triggering a build for ELN, the tool would first tag the current Rawhide builds into ELN and wait for them to be added to the Koji buildroot. This solved about 90% of the problems in a generic manner without engineering an excessively complicated side-tag approach. Naturally, it wasn’t a perfect solution, but it got a lot further. (See below for “Why are some package not tagged into the batch side-tag?” for more details. The most recent modification to this strategy came about as CentOS Stream 10 started to come into the picture. With the intent to bootstrap CS 10 initially from ELN, tagging Rawhide packages to the ELN tag suddenly became a problem, as CS 10 needs to use that tag event as its trigger. The solution here was not to tag Rawhide builds into Fedora ELN directly, but instead to create a new ELN side-tag target where we could tag them, build the ELN packages there and then tag the successful builds into ELN. As a result, CS 10 builds are only triggered on ELN successes. ## Frequently Asked Questions ### Why does it sometimes take a long time for my package to be rebuilt? Not all batches are created equal. Sometimes, there will be an ongoing batch with one or more packages whose build takes a very long time to complete. (e.g. gcc, firefox, chromium). This can lead to up to a day’s lag in even getting enqueued. Even if your package was part of the same batch, it will still wait for all packages in the batch to complete before the tag occurs. ### Why do batches not run in parallel? Simply put, until the previous batch is complete, there’s no way to know if a further batch relies on one or more changes from the previous batch. This is a problem we’re hoping might have a solution down the line, if it becomes possible to create “nested” side-tags (side-tags derived from another side-tag instead of a base tag). Today however, serialization is the only safe approach. ### Why are some packages not tagged into the batch side-tag? Some packages have known incompatibilities, such as libllvm and OCAML. The libraries produced in the ELN build and Rawhide build are API or ABI incompatible and therefore cannot be tagged in safely. We have to rely on the previous ELN version of the build in the buildroot. ### Why do you not tag successes back into ELN immediately? Not all ELN packages are built by the auto-rebuilder. Several are maintained individually for various reasons (the kernel, ceph, crypto-policies, etc.). We don’t want to tag a partial batch in out of concern that this could break these other builds. [1] Technically, the repository is called DistroBuildSync because originally it was meant to serve multiple purposes of rebuilding ELN from Rawhide and also syncing builds for CentOS Stream and RHEL. However, the latter two ended up forking off very significantly, so we renamed ours to ELNBuildSync to reduce confusion between them. It unfortunately retains the old name for the repo at the moment due to deployment-related reasons. ↩︎ _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue