On Thu, Aug 16, 2012 at 10:15 PM, Glauber Costa <glommer@xxxxxxxxxxxxx> wrote: > On 08/17/2012 03:41 AM, Dave Chinner wrote: >> On Thu, Aug 16, 2012 at 05:10:57PM -0400, Rik van Riel wrote: >>> On 08/16/2012 04:53 PM, Ying Han wrote: >>>> The patchset adds the functionality of isolating the vfs slab objects per-memcg >>>> under reclaim. This feature is a *must-have* after the kernel slab memory >>>> accounting which starts charging the slab objects into individual memcgs. The >>>> existing per-superblock shrinker doesn't work since it will end up reclaiming >>>> slabs being charged to other memcgs. >> >> What list was this posted to? > > This what? per-memcg slab accounting ? linux-mm and cgroups, and at > least once to lkml. > > You can also find the up2date version in my git tree: > > git://github.com/glommer/linux.git memcg-3.5/kmemcg-slab > > But then you mainly lose the discussion. You can find the thread at > http://lwn.net/Articles/508087/, and if you scan recent messages to > linux-mm, there is a lot there too. > >> The per-sb shrinkers are not intended for memcg granularity - they >> are for scalability in that they allow the removal of the global >> inode and dcache LRU locks and allow significant flexibility in >> cache relcaim strategies for filesystems. Hint: reclaiming >> the VFS inode cache doesn't free any memory on an XFS filesystem - >> it's the XFS inode cache shrinker that is integrated into the per-sb >> shrinker infrastructure that frees all the memory. It doesn't work >> without the per-sb shrinker functionality and it's an extremely >> performance critical balancing act. Hence any changes to this >> shrinker infrastructure need a lot of consideration and testing, >> most especially to ensure that the balance of the system has not >> been disturbed. >> > > I was actually wondering where the balance would stand between hooking > this into the current shrinking mechanism, and having something totally > separate for memcg. It is tempting to believe that we could get away > with something that works well for memcg-only, but this already proved > to be not true for the user pages lru list... > > >> Also how do yo propose to solve the problem of inodes and dentries >> shared across multiple memcgs? They can only be tracked in one LRU, >> but the caches are global and are globally accessed. > > I think the proposal is to not solve this problem. Because at first it > sounds a bit weird, let me explain myself: > > 1) Not all processes in the system will sit on a memcg. > Technically they will, but the root cgroup is never accounted, so a big > part of the workload can be considered "global" and will have no > attached memcg information whatsoever. > > 2) Not all child memcgs will have associated vfs objects, or kernel > objects at all, for that matter. This happens only when specifically > requested by the user. > > Due to that, I believe that although sharing is obviously a reality > within the VFS, but the workloads associated to this will tend to be > fairly local. When sharing does happen, we currently account to the > first process to ever touch the object. This is also how memcg treats > shared memory users for userspace pages and it is working well so far. > It doesn't *always* give you good behavior, but I guess those fall in > the list of "workloads memcg is not good for". > > Do we want to extend this list of use cases? Sure. There is also > discussion going on about how to improve this in the future. That would > allow a policy to specify which memcg is to be "responsible" for the > shared objects, be them kernel memory or shared memory regions. Even > then, we'll always have one of the two scenarios: > > 1) There is a memcg that is responsible for accounting that object, and > then is clear we should reclaim from that memcg. > > 2) There is no memcg associated with the object, and then we should not > bother with that object at all. In the patch I have, all objects are associated with *a* memcg. For those objects are charged to root or reparented to root, they do get associated with root and further memory pressure on root ( global reclaim ) will be applied on those objects. > > I fully understand your concern, specifically because we talked about > that in details in the past. But I believe most of the cases that would > justify it would fall in 2). > > Another thing to keep in mind is that we don't actually track objects. > We track pages, and try to make sure that objects in the same page > belong to the same memcg. (That could be important for your analysis or > not...) > >> Having mem >> pressure in a single memcg that causes globally accessed dentries >> and inodes to be tossed from memory will simply cause cache >> thrashing and performance across the system will tank. >> Not sure if that is the case after this patch. The global LRU is splitted per-memcg, and each dentry is linked to the per-memcg list. So under target reclaim of memcg A, it will only reclaim the hashtable bucket indexed by A but not others. > As said above. I don't consider global accessed dentries to be > representative of the current use cases for memcg. > >>>> The patch now is only handling dentry cache by given the nature dentry pinned >>>> inode. Based on the data we've collected, that contributes the main factor of >>>> the reclaimable slab objects. We also could make a generic infrastructure for >>>> all the shrinkers (if needed). >>> >>> Dave Chinner has some prototype code for that. >> >> The patchset I have makes the dcache lru locks per-sb as the first >> step to introducing generic per-sb LRU lists, and then builds on >> that to provide generic kernel-wide LRU lists with integrated >> shrinkers, and builds on that to introduce node-awareness (i.e. NUMA >> scalability) into the LRU list so everyone gets scalable shrinkers. >> > > If you are building a generic infrastructure for shrinkers, what is the > big point about per-sb? I'll give you that most of the memory will come > from the VFS, but other objects are shrinkable too, that bears no > relationship with the vfs. The patchset is trying to solve a very simple problem where allows shrink_slab() to locate the *right* dentry objects to reclaim with the memcg context. I haven't thought about the NUMA and node awareness for the shrinkers, and that sounds like something beyond than the problem I am trying to solve here. I might need to think a bit more of how that fits into the problem you described. > >> I've looked at memcg awareness in the past, but the problem is the >> overhead - the explosion of LRUs because of the per-sb X per-node X >> per-memcg object tracking matrix. It's a huge amount of overhead >> and complexity, and unless there's a way of efficiently tracking >> objects both per-node and per-memcg simulatneously then I'm of the >> opinion that memcg awareness is simply too much trouble, complexity >> and overhead to bother with. >> >> So, convince me you can solve the various problems. ;) >> > > I believe we are open minded regarding a solution for that, and your > input is obviously top. So let me take a step back and restate the problem: > > 1) Some memcgs, not all, will have memory pressure regardless of the > memory pressure in the rest of the system > 2) that memory pressure may or may not involve kernel objects. > 3) if kernel objects are involved, we can assume the level of sharing is > low. > 4) We then need to shrink memory from that memcg, affecting the others > the least we can. > > Do you have any proposals for that, in any shape? > > One thing that crossed my mind, was instead of having per-sb x per-node > objects, we could have per-"group" x per-node objects. The group would > then be either a memcg or a sb. Objects that doesn't belong to a memcg - > where we expect most of the globally accessed to fall, would be tied to > the sb. Global shrinkers, when called, would of course scan all groups. > Shrinking could also be triggered for the group. An object would of > course only live in one of them at a time. Not sure I understand this. Will think a bit more tomorrow morning when my brain works better :) --Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>