Re: user space unresponsive, followup: lsf/mm congestion

Michal Hocko <mhocko@xxxxxxxxxx> · Tue, 14 Jan 2020 10:46:17 +0100

On Fri 10-01-20 15:27:10, Chris Murphy wrote:
> On Fri, Jan 10, 2020 at 4:07 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >
> > So you have redirected the output (stdout) to a file. This is less
> > effective than using a file directly because the progy makes sure to
> > preallocate and mlock the output file data as well. Anyway, let's have a
> > look what you managed to gather
> 
> I just read the source :P and see the usage. I'll do that properly if
> there's a next time. Should it be saved in /tmp to avoid disk writes
> or does it not matter?

The usage is described at the top of the c file. As this is my internal
tool I am using I didn't bother to make it super easy ;)

> > It would interesting to see whether tuning vm_swappiness to 100 helps
> > but considering how large is the anonymous active list I would be very
> > skeptical.
> 
> I can try it. Is it better to capture the same amount of time as
> before? Or the entire thing until it fails or is stuck for at least 30
> minutes?

The last data provided a good insight so following the same methodology
should be good.

> > So in the end it is really hard to see what the kernel should have done
> > better in this overcommitted case. Killing memory hogs would likely kill
> > an active workload which would lead to better desktop experience but I
> > can imagine setups which simply want to have work done albeit sloooowly.
> 
> Right, so the kernel can't know and doesn't really want to know, user
> intention. It's really a policy question.
> 
> But if the distribution wanted to have a policy of, the mouse pointer
> always works - i.e. the user should be able to kill this process, if
> they want, from within the GUI - that implies possibly a lot of work
> to carve out the necessary resources for that entire stack. I have no
> idea if that's possible with the current state of things.

Well, you always have a choice to invoke the oom killer by sysrq+f and
kill the memory hog like that. The more memory demanding the userspace
is the more users have to think how to partition the memory as a
resource. We have tooling for that it just has to be used.

> Anyway, I see it's a difficult problem, and I appreciate the
> explanations. I don't care about this particular example, my interest
> is making it better for everyone - I personally run into this only
> when I'm testing for it, but those who experience it, experience it
> often. And they're often developers. They have no idea in advance what
> the build resource requirements are, and those requirements change a
> lot as the compile happens. Difficult problem.

I can only encourage people to report those problems and we can see
where we get from there. Underlying problem might be different even
though symptoms seem to be similar.
-- 
Michal Hocko
SUSE Labs