Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps

Marcin Jabrzyk <m.jabrzyk@xxxxxxxxxxx> · Wed, 24 Aug 2016 12:14:06 +0200

On 23/08/16 00:44, Sonny Rao wrote:
On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
On Fri 19-08-16 10:57:48, Sonny Rao wrote:
On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
On Thu 18-08-16 23:43:39, Sonny Rao wrote:
On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
On Thu 18-08-16 10:47:57, Sonny Rao wrote:
On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
On Wed 17-08-16 11:57:56, Sonny Rao wrote:
[...]
2) User space OOM handling -- we'd rather do a more graceful shutdown
than let the kernel's OOM killer activate and need to gather this
information and we'd like to be able to get this information to make
the decision much faster than 400ms

Global OOM handling in userspace is really dubious if you ask me. I
understand you want something better than SIGKILL and in fact this is
already possible with memory cgroup controller (btw. memcg will give
you a cheap access to rss, amount of shared, swapped out memory as
well). Anyway if you are getting close to the OOM your system will most
probably be really busy and chances are that also reading your new file
will take much more time. I am also not quite sure how is pss useful for
oom decisions.

I mentioned it before, but based on experience RSS just isn't good
enough -- there's too much sharing going on in our use case to make
the correct decision based on RSS.  If RSS were good enough, simply
put, this patch wouldn't exist.

But that doesn't answer my question, I am afraid. So how exactly do you
use pss for oom decisions?

We use PSS to calculate the memory used by a process among all the
processes in the system, in the case of Chrome this tells us how much
each renderer process (which is roughly tied to a particular "tab" in
Chrome) is using and how much it has swapped out, so we know what the
worst offenders are -- I'm not sure what's unclear about that?

So let me ask more specifically. How can you make any decision based on
the pss when you do not know _what_ is the shared resource. In other
words if you select a task to terminate based on the pss then you have to
kill others who share the same resource otherwise you do not release
that shared resource. Not to mention that such a shared resource might
be on tmpfs/shmem and it won't get released even after all processes
which map it are gone.

Ok I see why you're confused now, sorry.

In our case that we do know what is being shared in general because
the sharing is mostly between those processes that we're looking at
and not other random processes or tmpfs, so PSS gives us useful data
in the context of these processes which are sharing the data
especially for monitoring between the set of these renderer processes.

OK, I see and agree that pss might be useful when you _know_ what is
shared. But this sounds quite specific to a particular workload. How
many users are in a similar situation? In other words, if we present
a single number without the context, how much useful it will be in
general? Is it possible that presenting such a number could be even
misleading for somebody who doesn't have an idea which resources are
shared? These are all questions which should be answered before we
actually add this number (be it a new/existing proc file or a syscall).
I still believe that the number without wider context is just not all
that useful.

I see the specific point about  PSS -- because you need to know what
is being shared or otherwise use it in a whole system context, but I
still think the whole system context is a valid and generally useful
thing.  But what about the private_clean and private_dirty?  Surely
those are more generally useful for calculating a lower bound on
process memory usage without additional knowledge?

At the end of the day all of these metrics are approximations, and it
comes down to how far off the various approximations are and what
trade offs we are willing to make.
RSS is the cheapest but the most coarse.

PSS (with the correct context) and Private data plus swap are much
better but also more expensive due to the PT walk.
As far as I know, to get anything but RSS we have to go through smaps
or use memcg.  Swap seems to be available in /proc/<pid>/status.

I looked at the "shared" value in /proc/<pid>/statm but it doesn't
seem to correlate well with the shared value in smaps -- not sure why?

It might be useful to show the magnitude of difference of using RSS vs
PSS/Private in the case of the Chrome renderer processes.  On the
system I was looking at there were about 40 of these processes, but I
picked a few to give an idea:

localhost ~ # cat /proc/21550/totmaps
Rss:               98972 kB
Pss:               54717 kB
Shared_Clean:      19020 kB
Shared_Dirty:      26352 kB
Private_Clean:         0 kB
Private_Dirty:     53600 kB
Referenced:        92184 kB
Anonymous:         46524 kB
AnonHugePages:     24576 kB
Swap:              13148 kB

RSS is 80% higher than PSS and 84% higher than private data

localhost ~ # cat /proc/21470/totmaps
Rss:              118420 kB
Pss:               70938 kB
Shared_Clean:      22212 kB
Shared_Dirty:      26520 kB
Private_Clean:         0 kB
Private_Dirty:     69688 kB
Referenced:       111500 kB
Anonymous:         79928 kB
AnonHugePages:     24576 kB
Swap:              12964 kB

RSS is 66% higher than RSS and 69% higher than private data

localhost ~ # cat /proc/21435/totmaps
Rss:               97156 kB
Pss:               50044 kB
Shared_Clean:      21920 kB
Shared_Dirty:      26400 kB
Private_Clean:         0 kB
Private_Dirty:     48836 kB
Referenced:        90012 kB
Anonymous:         75228 kB
AnonHugePages:     24576 kB
Swap:              13064 kB

RSS is 94% higher than PSS and 98% higher than private data.

It looks like there's a set of about 40MB of shared pages which cause
the difference in this case.
Swap was roughly even on these but I don't think it's always going to be true.

Sorry to hijack the thread, but I've found it recently
and I guess it's the best place to present our point.
We are working at our custom OS based on Linux and we also suffered much
by /proc/<pid>/smaps file. As in Chrome we tried to improve our internal
application memory management polices (Low Memory Killer) using data
provided by smaps but we failed due to very long time needed for reading
and parsing properly the file.

We've also observed that RSS measurement is often highly over PSS which
seems to be more real memory usage for process. Using smaps we would
be able to calculate USS usage and know exact minimum value of memory
that would be freed after terminating some process. Those are very
important sources of information as they give as the possibility to
provide best possible app life-cycle.

We have also tried to use smaps in some application for OS developers
as source of detailed information of memory usage of the system.
For checking possible ways of improvement we tried totmaps from earlier
version. On sample case for our app the CPU usage as presented by 'top'
decreases from ~60% to ~4.5% only by changing source from smpas to tomaps.

So we are also very interested in using interface such as totmaps as it
gives detailed and complete memory usage information for user-space and
in our case much of information provided by smaps is for us not useful
at all.

We are also using or tried using other interfaces like status, statm,
cgroups.memory etc. but still totmaps/smaps are still the best interface
to get all of the informations per process based in single place.

We also use the private clean and private dirty and swap fields to
make a few metrics for the processes and charge each process for it's
private, shared, and swap data. Private clean and dirty are used for
estimating a lower bound on how much memory would be freed.

I can imagine that this kind of information might be useful and
presented in /proc/<pid>/statm. The question is whether some of the
existing consumers would see the performance impact due to he page table
walk. Anyway even these counters might get quite tricky because even
shareable resources are considered private if the process is the only
one to map them (so again this might be a file on tmpfs...).

Swap and
PSS also give us some indication of additional memory which might get
freed up.
--
Michal Hocko
SUSE Labs

--
Marcin Jabrzyk
Samsung R&D Institute Poland
Samsung Electronics
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html