Re: drop bfq scheduler, instead use mq-deadline across the board

Ankur Sinha <sanjay.ankur@xxxxxxxxx> · Tue, 30 Jun 2020 19:28:53 +0100

On Tue, Jun 30, 2020 17:23:16 +0000, Zbigniew Jędrzejewski-Szmek wrote:
> On Tue, Jun 30, 2020 at 04:25:23PM +0100, Ankur Sinha wrote:
> > On Mon, Jun 29, 2020 15:01:24 -0600, Chris Murphy wrote:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1851783
> > > 
> > > The main argument is that for typical and varied workloads in Fedora,
> > > mostly on consumer hardware, we should use mq-deadline scheduler
> > > rather than either none or bfq.
> > > 
> > > It may be true most folks with NVMe won't see anything bad with none,
> > > but those who have heavier IO workloads are likely to be better off
> > > with mq-deadline.
> > > 
> > > Further details are in the bug, but let's discuss it on list. Thanks!
> > 
> > There was this thread about our systems hanging, and the workaround was
> > to revert to mq-deadline from bfq:
> > 
> > https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx/thread/MJJFT5AOYUFZ3SO2EDVLJSDAZMZI4HAP/#DA7RCQFIAD4Z3Q7HQBW2ELPTLPYDKJMT
> 
> To clarify: you could reliably reproduce the issue when building steps in mock.
> Did you verify that it is reliably fixed simply by switching bfq→mq-deadline?

Yes, that was the first change I had made and it had stopped the
hanging. As a permanent fix, though, I switched to using isolation =
simple in mock, and since that works, I've not changed it since.

(I make it a point to provide the needed information for bugs, but this
release my quota is currently being used up on getting Docker + minikube
to work on F32 for $dayjob)

> > There are a few threads on AskFedora about systems hanging. They're not
> > the easiest to debug but we did suggest people try switching to
> > mq-deadline to see if it helps:
> > 
> > https://ask.fedoraproject.org/t/whole-os-freezes-watching-a-video-with-mpv/6770/10
> > 
> > I don't know enough about this to say if it's a bug and if it has been
> > fixed.
> 
> There's a lot of noise in those bug reports. For heisenbugs, the fact
> that something was an issue and after a flurry of half-random changes
> to the system isn't, does not allow us conclude _anything_. We need
> somebody who understands what they are doing to isolate the issue. In
> particular, if this is a kernel hang, than we need a proper traceback
> from the kernel, and not just assume it's the scheduler.

There is a kernel trace in the related bug that was cited there:
https://bugzilla.redhat.com/show_bug.cgi?id=1767097#c7

which links to another bfq bug here that's currently needinfo:
https://bugzilla.redhat.com/show_bug.cgi?id=1767539

> (In particular, if this is a race condition, changing the scheduler
> could be just making the condition less likely because the system is
> slower or faster or just schedules processes in a different order,
> without the scheduler being relevant to the bug).

Like I said, I don't know. I'm a fairly advanced Linux user but you can
hardly me to also be kernel hacker.  :)

For kernel bugs, I'd strongly suggest giving reporters steps by step
instructions or links to using a "serial console" or a "netconsole".
These are not part of my working vocabulary (I cannot speak for others).

-- 
Thanks,
Regards,
Ankur Sinha "FranciscoD" (He / Him / His) | https://fedoraproject.org/wiki/User:Ankursinha
Time zone: Europe/London
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx