Re: user space unresponsive, followup: lsf/mm congestion

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Wed, 8 Jan 2020 14:14:22 -0700

On Wed, Jan 8, 2020 at 2:25 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>
> On Tue 07-01-20 14:25:46, Chris Murphy wrote:
> > On Tue, Jan 7, 2020 at 1:58 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> [...]
> > > Btw. from a quick look at the sysrq output there seems to be quite a lot
> > > of tasks (more than 1k) running on the system. Only handful of them
> > > belong to the compilation. kswapd is busy and 13 processes in direct
> > > reclaim all swapping out to the disk.
> >
> > There might be many dozens of tabs in Firefox with nothing loaded in
> > them, trying to keep the testing more real world (a compile while
> > browsing) rather than being too deferential to the compile. That does
> > clutter the sysrq+t but it doesn't change the outcome of the central
> > culprit which is the ninja compile, which by default does n+2 jobs
> > where n is the number of virtual CPUs.
>
> How much memory does the compile process eat?

By default it sets jobs to numcpus+2, which is 10. But each job
variably has two processes, and each process's memory requirement
varies a ton, few M to over 1G. In the first 20 minutes, about 13000
processes have started and stopped.

I've updated the bug, attaching kernel messages and /proc/vmstate in
1s increments, although quite often during the build multiple seconds
of sampling were just skipped as the system was under too much
pressure.

> If you know that the compilation process is too disruptive wrt.
> memory/cpu consumption then you can use cgroups (memory and cpu
> controllers) to throttle that consumption and protect the rest of the
> system. The compilation process will take much more time of course and
> the explicit configuration is obviously less comfortable than out of the
> box auto configuration but the kernel simply doesn't have information to
> prioritize resources.

Yes but this isn't scalable for regular users who just follow an
upstream's build instructions.

> I do agree that the oom detection could be improved to detect a heavy
> threshing - be it on page cache or swapin/out - and kill something
> rather than leave the system struggling in a highly unproductive state.
> This is far from trivial because what is productive is not something
> kernel can tell easily as it depends on the workload. As mentioned
> elsewhere userspace is likely much better suited to define that policy
> and PSI seems to be a good indicator.

And even user space doesn't know what resources are required in
advance. The user can guess this has been estimated incorrectly, force
power off, start over by passing a lower number of jobs or whatever.

As for PSI, from oomd folks it sounds like swap is a requirement. And
yet, because of the poor performance of swapping, quite a lot of users
don't have any swap. It's also mixed in server environments to have
swap, and rare in cloud environments to have swap. So if there's a
hard requirement on swap existing, PSI isn't a universal solution.

Thanks,

-- 
Chris Murphy