Re: [RFC] [PATCH 4/4] memcg: Document kernel memory accounting.

Glauber Costa <glommer@xxxxxxxxxxxxx> · Mon, 17 Oct 2011 12:56:09 +0400

On 10/15/2011 04:38 AM, Suleiman Souhlal wrote:
Signed-off-by: Suleiman Souhlal<suleiman@xxxxxxxxxx>
---
  Documentation/cgroups/memory.txt |   33 ++++++++++++++++++++++++++++++++-
  1 files changed, 32 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 06eb6d9..277cf25 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -220,7 +220,37 @@ caches are dropped. But as mentioned above, global LRU can do swapout memory
  from it for sanity of the system's memory management state. You can't forbid
  it by cgroup.

-2.5 Reclaim
+2.5 Kernel Memory
+
+A cgroup's kernel memory is accounted into its memory.usage_in_bytes and
+is also shown in memory.stat as kernel_memory. Kernel memory does not get
+counted towards the root cgroup's memory.usage_in_bytes, but still
+appears in its kernel_memory.
+
+Upon cgroup deletion, all the remaining kernel memory gets moved to the
+root cgroup.
+
+An accounted kernel memory allocation may trigger reclaim in that cgroup,
+and may also OOM.
+
+Currently only slab memory allocated without __GFP_NOACCOUNT and
+__GFP_NOFAIL gets accounted to the current process' cgroup.
+
+2.5.1 Slab
+
+Slab gets accounted on a per-page basis, which is done by using per-cgroup
+kmem_caches. These per-cgroup kmem_caches get created on-demand, the first
+time a specific kmem_cache gets used by a cgroup.

Well, let me first start with some general comments:

I think the approach I've taken, which is, allowing the cache creators 
to register themselves for cgroup usage, is better than scanning the 
list of existing caches. Couple of key reasons:

1) We then don't need another flag. _GFP_NOACCOUNT => doing nothing.
2) Less polution in the slab structure itself, which makes it have
higher chances of inclusion, and less duplicate work in the slub.
3) Easier to do per-cache tuning if we ever want to.

About, on-demand creation, I think it is a nice idea. But it may impact 
allocation latency on caches that we are sure to be used, like the 
dentry cache. So that gives us:

4) If the cache creator is registering itself, it can specify which 
behavior it wants. On-Demand creation vs Straight creation.

+Slab memory that cannot be attributed to a cgroup gets charged to the root
+cgroup.
+
+A per-cgroup kmem_cache is named like the original, with the cgroup's name
+in parethesis.

I used the address for simplicity, but I like names better. Agree here.
Extending it: If a task resides in the cgroup itself, I think it should 
see its cache only, in /proc/slabinfo (selectable, take a look at 
https://lkml.org/lkml/2011/10/6/132 for more details)

+When a kmem_cache gets migrated to the root cgroup, "dead" is appended to
+its name, to indicated that it is not going to be used for new allocations.

Why not just remove it?

+2.6 Reclaim

  Each cgroup maintains a per cgroup LRU which has the same structure as
  global VM. When a cgroup goes over its limit, we first try
@@ -396,6 +426,7 @@ active_anon	- # of bytes of anonymous and swap cache memory on active
  inactive_file	- # of bytes of file-backed memory on inactive LRU list.
  active_file	- # of bytes of file-backed memory on active LRU list.
  unevictable	- # of bytes of memory that cannot be reclaimed (mlocked etc).
+kernel_memory   - # of bytes of kernel memory.

  # status considering hierarchy (see memory.use_hierarchy settings)


Another

* I think usage of res_counters is better than relying on slab fields to 
impose limits,
* We still need the ability to restrict kernel memory usage separately 
from user memory, dependent on a selectable, as we already discussed here.
* I think we should do everything in our power to reduce overhead for 
the special case in which only the root cgroup exist . Take a look at 
what happened with the following thread: 
https://lkml.org/lkml/2011/10/13/201. To be honest, I think it is an 
idea we should least consider: not to account *anything* to the root 
cgroup (make a selectable if we want to conserve behaviour), user 
memory, kernel memory. Then we can keep native performance for 
non-cgroup users. (But that's another discussion anyway)

All in all, this is a good start. Both our approaches have a lot in 
common (well, which is not strange, given that we discussed them a lot 
on the past month =p, and I did like some concepts)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>