Re: user space unresponsive, followup: lsf/mm congestion

Michal Hocko <mhocko@xxxxxxxxxx> · Thu, 9 Jan 2020 12:51:47 +0100

On Wed 08-01-20 14:14:22, Chris Murphy wrote:
> On Wed, Jan 8, 2020 at 2:25 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >
> > On Tue 07-01-20 14:25:46, Chris Murphy wrote:
> > > On Tue, Jan 7, 2020 at 1:58 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > [...]
> > > > Btw. from a quick look at the sysrq output there seems to be quite a lot
> > > > of tasks (more than 1k) running on the system. Only handful of them
> > > > belong to the compilation. kswapd is busy and 13 processes in direct
> > > > reclaim all swapping out to the disk.
> > >
> > > There might be many dozens of tabs in Firefox with nothing loaded in
> > > them, trying to keep the testing more real world (a compile while
> > > browsing) rather than being too deferential to the compile. That does
> > > clutter the sysrq+t but it doesn't change the outcome of the central
> > > culprit which is the ninja compile, which by default does n+2 jobs
> > > where n is the number of virtual CPUs.
> >
> > How much memory does the compile process eat?
> 
> By default it sets jobs to numcpus+2, which is 10. But each job
> variably has two processes, and each process's memory requirement
> varies a ton, few M to over 1G. In the first 20 minutes, about 13000
> processes have started and stopped.

Well, the question is whether the memory demand comes from the
parallelism (aka something controlled by the build process) or the
compilation itself which might be really memory hungry and hard to
overcome even in the parallelism disabled. If this the later is the case
then there is not really much the kernel can do, I am afraid. If the OOM
killer kills your compilation or other important part of your
environment then you just lost a work without any gain, right? You
simply need more memory to handle that workload or throttle the
compilation and give it much more time to finish because it will be
swapping in and out as the working set would not fit into the restricted
amount of memory.

> I've updated the bug, attaching kernel messages and /proc/vmstate in
> 1s increments, although quite often during the build multiple seconds
> of sampling were just skipped as the system was under too much
> pressure.

I have a tool to reduce that problem (see attached).

> > If you know that the compilation process is too disruptive wrt.
> > memory/cpu consumption then you can use cgroups (memory and cpu
> > controllers) to throttle that consumption and protect the rest of the
> > system. The compilation process will take much more time of course and
> > the explicit configuration is obviously less comfortable than out of the
> > box auto configuration but the kernel simply doesn't have information to
> > prioritize resources.
> 
> Yes but this isn't scalable for regular users who just follow an
> upstream's build instructions.

It certainly requires additional steps, no question about that. But I
fail to see how to do that automagically without knowing what the user
expects to happen in the resource shortage.

> > I do agree that the oom detection could be improved to detect a heavy
> > threshing - be it on page cache or swapin/out - and kill something
> > rather than leave the system struggling in a highly unproductive state.
> > This is far from trivial because what is productive is not something
> > kernel can tell easily as it depends on the workload. As mentioned
> > elsewhere userspace is likely much better suited to define that policy
> > and PSI seems to be a good indicator.
> 
> And even user space doesn't know what resources are required in
> advance. The user can guess this has been estimated incorrectly, force
> power off, start over by passing a lower number of jobs or whatever.
> 
> As for PSI, from oomd folks it sounds like swap is a requirement. And
> yet, because of the poor performance of swapping, quite a lot of users
> don't have any swap. It's also mixed in server environments to have
> swap, and rare in cloud environments to have swap. So if there's a
> hard requirement on swap existing, PSI isn't a universal solution.

I cannot really comment on the swap requirement but I would recommend
to have a swap space especially for workloads which have peak memory
consumption that doesn't fit into memory. Additional configuration steps
might be needed so that the whole system doesn't thresh on the swap
(e.g. use memcgs with proper main memory partitioning) but once you are
over committing the memory you have to be careful I believe.
-- 
Michal Hocko
SUSE Labs