On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote: > On Fri 19-08-16 10:57:48, Sonny Rao wrote: >> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote: >> > On Thu 18-08-16 23:43:39, Sonny Rao wrote: >> >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote: >> >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote: >> >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote: >> >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> >> > [...] >> >> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown >> >> >> >> than let the kernel's OOM killer activate and need to gather this >> >> >> >> information and we'd like to be able to get this information to make >> >> >> >> the decision much faster than 400ms >> >> >> > >> >> >> > Global OOM handling in userspace is really dubious if you ask me. I >> >> >> > understand you want something better than SIGKILL and in fact this is >> >> >> > already possible with memory cgroup controller (btw. memcg will give >> >> >> > you a cheap access to rss, amount of shared, swapped out memory as >> >> >> > well). Anyway if you are getting close to the OOM your system will most >> >> >> > probably be really busy and chances are that also reading your new file >> >> >> > will take much more time. I am also not quite sure how is pss useful for >> >> >> > oom decisions. >> >> >> >> >> >> I mentioned it before, but based on experience RSS just isn't good >> >> >> enough -- there's too much sharing going on in our use case to make >> >> >> the correct decision based on RSS. If RSS were good enough, simply >> >> >> put, this patch wouldn't exist. >> >> > >> >> > But that doesn't answer my question, I am afraid. So how exactly do you >> >> > use pss for oom decisions? >> >> >> >> We use PSS to calculate the memory used by a process among all the >> >> processes in the system, in the case of Chrome this tells us how much >> >> each renderer process (which is roughly tied to a particular "tab" in >> >> Chrome) is using and how much it has swapped out, so we know what the >> >> worst offenders are -- I'm not sure what's unclear about that? >> > >> > So let me ask more specifically. How can you make any decision based on >> > the pss when you do not know _what_ is the shared resource. In other >> > words if you select a task to terminate based on the pss then you have to >> > kill others who share the same resource otherwise you do not release >> > that shared resource. Not to mention that such a shared resource might >> > be on tmpfs/shmem and it won't get released even after all processes >> > which map it are gone. >> >> Ok I see why you're confused now, sorry. >> >> In our case that we do know what is being shared in general because >> the sharing is mostly between those processes that we're looking at >> and not other random processes or tmpfs, so PSS gives us useful data >> in the context of these processes which are sharing the data >> especially for monitoring between the set of these renderer processes. > > OK, I see and agree that pss might be useful when you _know_ what is > shared. But this sounds quite specific to a particular workload. How > many users are in a similar situation? In other words, if we present > a single number without the context, how much useful it will be in > general? Is it possible that presenting such a number could be even > misleading for somebody who doesn't have an idea which resources are > shared? These are all questions which should be answered before we > actually add this number (be it a new/existing proc file or a syscall). > I still believe that the number without wider context is just not all > that useful. I see the specific point about PSS -- because you need to know what is being shared or otherwise use it in a whole system context, but I still think the whole system context is a valid and generally useful thing. But what about the private_clean and private_dirty? Surely those are more generally useful for calculating a lower bound on process memory usage without additional knowledge? At the end of the day all of these metrics are approximations, and it comes down to how far off the various approximations are and what trade offs we are willing to make. RSS is the cheapest but the most coarse. PSS (with the correct context) and Private data plus swap are much better but also more expensive due to the PT walk. As far as I know, to get anything but RSS we have to go through smaps or use memcg. Swap seems to be available in /proc/<pid>/status. I looked at the "shared" value in /proc/<pid>/statm but it doesn't seem to correlate well with the shared value in smaps -- not sure why? It might be useful to show the magnitude of difference of using RSS vs PSS/Private in the case of the Chrome renderer processes. On the system I was looking at there were about 40 of these processes, but I picked a few to give an idea: localhost ~ # cat /proc/21550/totmaps Rss: 98972 kB Pss: 54717 kB Shared_Clean: 19020 kB Shared_Dirty: 26352 kB Private_Clean: 0 kB Private_Dirty: 53600 kB Referenced: 92184 kB Anonymous: 46524 kB AnonHugePages: 24576 kB Swap: 13148 kB RSS is 80% higher than PSS and 84% higher than private data localhost ~ # cat /proc/21470/totmaps Rss: 118420 kB Pss: 70938 kB Shared_Clean: 22212 kB Shared_Dirty: 26520 kB Private_Clean: 0 kB Private_Dirty: 69688 kB Referenced: 111500 kB Anonymous: 79928 kB AnonHugePages: 24576 kB Swap: 12964 kB RSS is 66% higher than RSS and 69% higher than private data localhost ~ # cat /proc/21435/totmaps Rss: 97156 kB Pss: 50044 kB Shared_Clean: 21920 kB Shared_Dirty: 26400 kB Private_Clean: 0 kB Private_Dirty: 48836 kB Referenced: 90012 kB Anonymous: 75228 kB AnonHugePages: 24576 kB Swap: 13064 kB RSS is 94% higher than PSS and 98% higher than private data. It looks like there's a set of about 40MB of shared pages which cause the difference in this case. Swap was roughly even on these but I don't think it's always going to be true. > >> We also use the private clean and private dirty and swap fields to >> make a few metrics for the processes and charge each process for it's >> private, shared, and swap data. Private clean and dirty are used for >> estimating a lower bound on how much memory would be freed. > > I can imagine that this kind of information might be useful and > presented in /proc/<pid>/statm. The question is whether some of the > existing consumers would see the performance impact due to he page table > walk. Anyway even these counters might get quite tricky because even > shareable resources are considered private if the process is the only > one to map them (so again this might be a file on tmpfs...). > >> Swap and >> PSS also give us some indication of additional memory which might get >> freed up. > -- > Michal Hocko > SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html