Re: s390x KOJI builders issue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In many cases, the build is killed during compilation itself.
I'd understand the situation, if it would consistently fail somewhere
during the testsuite on OOM errors, but it's weirder than that.

Until now, I didn't have this issue. Why now?

The tests are still important.
Through the years I took several steps to reduce the resource usage
for the testsuite.
The most significant is that I ran the full testsuite only once or few
times in scratch builds, and when I didn't find any issues worth
investigating, I switch the testsuite to a minimal mode for every
other build of the same minor versions.
So e.g. mass rebuilds which only bump patch numbers in the NVR run
only the 'main' suite. As well as other small patches during the life
of that particular upstream release.

The issue in general is:
We have the majority of packages which are small and quick to build.
Then we have a minority of insanely huge projects, whose resource
thirst can never be quenched. :)

Could we somehow just identify the huge packages, mark them in a
special way, and when KOJI would pick up such marked packages, it
would give it much more resources ?
At the same time, the average amount of resources given should be
lowered to only what most packages need.
I believe all could benefit from this.

Michal
--

Michal Schorm
Software Engineer
Core Services - Databases Team
Red Hat

--

On Thu, Mar 3, 2022 at 1:05 AM Kevin Fenzi <kevin@xxxxxxxxx> wrote:
>
> On Wed, Mar 02, 2022 at 03:54:32PM +0100, Florian Weimer wrote:
> > * Michael Catanzaro:
> >
> > > On Wed, Mar 2 2022 at 02:21:22 PM +0100, Dan Horák <dan@xxxxxxxx>
> > > wrote:
> > >> those are weird, the build tasks have been restarted many times by the
> > >> builder daemon, after something crashed there (OOM?) ...
> > >
> > > This was happening to me on armv7hl a few weeks ago. Kevin Fenzi
> > > investigated and discovered that the builds kept hitting an OOM
> > > condition and then restarting, which triggered an infinite loop. Each
> > > build would work for 3-5 hours before failing, then it would start
> > > over, then again, then again....
> > >
> > > I think some configuration changed recently on the builders, because I
> > > had never seen this happen before last month. If a build hits OOM, it
> > > really needs to fail immediately. It should not restart, because it's
> > > likely to fail again the same way. My builds had restarted four or
> > > five times before Kevin manually handled them.
> >
> > Maybe Koji restarts the build because the builder has rebooted?
>
> Nope.
>
> What happens is:
>
> * 10: Build is taken by builder and starts building.
> * Build takes up more than 90% of memory+swap
> * OOm killer looks and says... oh hey, I need to kill something. This
> kojid process/slice is taking up all the memory.
> * kojid is killed.
> * kojid is restarted (we have it set to restart in unit)
> * builder checks into hub
> * hub says, hey you are doing task XXXXX right?
> * builder says... oh, yes, let me start that.
> * goto 10
>
> So in this case it seems like it's the tests that are causing this.
> The s390x kvm builders have 2cpus and 10gb of memory.
>
> So, is there any way to decrease memory usage there?
> I see the tests have -parallel=auto perhaps that could be set to 1 or 2?
>
> Perhaps there's some way to adjust the oom killer to kill the build
> instead of kojid? I would prefer that because then the build would
> quickly fail and you could see it was killed and need to reduce memory
> consumption somehow.
>
> I suppose we could look at reducing the number of builders and
> increasing memory on fewer of them, but it's hard to know what the right
> value is there. it's definitely better for mass rebuilds to have more
> smaller builders.
>
> kevin
> _______________________________________________
> devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Fedora Announce]     [Fedora Users]     [Fedora Kernel]     [Fedora Testing]     [Fedora Formulas]     [Fedora PHP Devel]     [Kernel Development]     [Fedora Legacy]     [Fedora Maintainers]     [Fedora Desktop]     [PAM]     [Red Hat Development]     [Gimp]     [Yosemite News]

  Powered by Linux