Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 16, 2012 at 10:15 PM, Glauber Costa <glommer@xxxxxxxxxxxxx> wrote:
> On 08/17/2012 03:41 AM, Dave Chinner wrote:
>> On Thu, Aug 16, 2012 at 05:10:57PM -0400, Rik van Riel wrote:
>>> On 08/16/2012 04:53 PM, Ying Han wrote:
>>>> The patchset adds the functionality of isolating the vfs slab objects per-memcg
>>>> under reclaim. This feature is a *must-have* after the kernel slab memory
>>>> accounting which starts charging the slab objects into individual memcgs. The
>>>> existing per-superblock shrinker doesn't work since it will end up reclaiming
>>>> slabs being charged to other memcgs.
>>
>> What list was this posted to?
>
> This what? per-memcg slab accounting ? linux-mm and cgroups, and at
> least once to lkml.
>
> You can also find the up2date version in my git tree:
>
>   git://github.com/glommer/linux.git memcg-3.5/kmemcg-slab
>
> But then you mainly lose the discussion. You can find the thread at
> http://lwn.net/Articles/508087/, and if you scan recent messages to
> linux-mm, there is a lot there too.
>
>> The per-sb shrinkers are not intended for memcg granularity - they
>> are for scalability in that they allow the removal of the global
>> inode and dcache LRU locks and allow significant flexibility in
>> cache relcaim strategies for filesystems. Hint: reclaiming
>> the VFS inode cache doesn't free any memory on an XFS filesystem -
>> it's the XFS inode cache shrinker that is integrated into the per-sb
>> shrinker infrastructure that frees all the memory. It doesn't work
>> without the per-sb shrinker functionality and it's an extremely
>> performance critical balancing act. Hence any changes to this
>> shrinker infrastructure need a lot of consideration and testing,
>> most especially to ensure that the balance of the system has not
>> been disturbed.
>>
>
> I was actually wondering where the balance would stand between hooking
> this into the current shrinking mechanism, and having something totally
> separate for memcg. It is tempting to believe that we could get away
> with something that works well for memcg-only, but this already proved
> to be not true for the user pages lru list...
>
>
>> Also how do yo propose to solve the problem of inodes and dentries
>> shared across multiple memcgs?  They can only be tracked in one LRU,
>> but the caches are global and are globally accessed.
>
> I think the proposal is to not solve this problem. Because at first it
> sounds a bit weird, let me explain myself:
>
> 1) Not all processes in the system will sit on a memcg.
> Technically they will, but the root cgroup is never accounted, so a big
> part of the workload can be considered "global" and will have no
> attached memcg information whatsoever.
>
> 2) Not all child memcgs will have associated vfs objects, or kernel
> objects at all, for that matter. This happens only when specifically
> requested by the user.
>
> Due to that, I believe that although sharing is obviously a reality
> within the VFS, but the workloads associated to this will tend to be
> fairly local. When sharing does happen, we currently account to the
> first process to ever touch the object. This is also how memcg treats
> shared memory users for userspace pages and it is working well so far.
> It doesn't *always* give you good behavior, but I guess those fall in
> the list of "workloads memcg is not good for".
>
> Do we want to extend this list of use cases? Sure. There is also
> discussion going on about how to improve this in the future. That would
> allow a policy to specify which memcg is to be "responsible" for the
> shared objects, be them kernel memory or shared memory regions. Even
> then, we'll always have one of the two scenarios:
>
> 1) There is a memcg that is responsible for accounting that object, and
> then is clear we should reclaim from that memcg.
>
> 2) There is no memcg associated with the object, and then we should not
> bother with that object at all.

In the patch I have, all objects are associated with *a* memcg. For
those objects are charged to root or reparented to root,
they do get associated with root and further memory pressure on root (
global reclaim ) will be applied on those objects.

>
> I fully understand your concern, specifically because we talked about
> that in details in the past. But I believe most of the cases that would
> justify it would fall in 2).
>
> Another thing to keep in mind is that we don't actually track objects.
> We track pages, and try to make sure that objects in the same page
> belong to the same memcg. (That could be important for your analysis or
> not...)
>
>> Having mem
>> pressure in a single memcg that causes globally accessed dentries
>> and inodes to be tossed from memory will simply cause cache
>> thrashing and performance across the system will tank.
>>
Not sure if that is the case after this patch. The global LRU is
splitted per-memcg, and each dentry is linked to
the per-memcg list. So under target reclaim of memcg A, it will only
reclaim the hashtable bucket indexed by
A but not others.

> As said above. I don't consider global accessed dentries to be
> representative of the current use cases for memcg.

>
>>>> The patch now is only handling dentry cache by given the nature dentry pinned
>>>> inode. Based on the data we've collected, that contributes the main factor of
>>>> the reclaimable slab objects. We also could make a generic infrastructure for
>>>> all the shrinkers (if needed).
>>>
>>> Dave Chinner has some prototype code for that.
>>
>> The patchset I have makes the dcache lru locks per-sb as the first
>> step to introducing generic per-sb LRU lists, and then builds on
>> that to provide generic kernel-wide LRU lists with integrated
>> shrinkers, and builds on that to introduce node-awareness (i.e. NUMA
>> scalability) into the LRU list so everyone gets scalable shrinkers.
>>
>
> If you are building a generic infrastructure for shrinkers, what is the
> big point about per-sb? I'll give you that most of the memory will come
> from the VFS, but other objects are shrinkable too, that bears no
> relationship with the vfs.

The patchset is trying to solve a very simple problem where allows
shrink_slab() to locate the *right* dentry objects to reclaim
with the memcg context.

I haven't thought about the NUMA and node awareness for the shrinkers,
and that sounds like something
beyond than the problem I am trying to solve here. I might need to
think a bit more of how that fits into the problem you described.

>
>> I've looked at memcg awareness in the past, but the problem is the
>> overhead - the explosion of LRUs because of the per-sb X per-node X
>> per-memcg object tracking matrix.  It's a huge amount of overhead
>> and complexity, and unless there's a way of efficiently tracking
>> objects both per-node and per-memcg simulatneously then I'm of the
>> opinion that memcg awareness is simply too much trouble, complexity
>> and overhead to bother with.
>>
>> So, convince me you can solve the various problems. ;)
>>
>
> I believe we are open minded regarding a solution for that, and your
> input is obviously top. So let me take a step back and restate the problem:
>
> 1) Some memcgs, not all, will have memory pressure regardless of the
> memory pressure in the rest of the system
> 2) that memory pressure may or may not involve kernel objects.
> 3) if kernel objects are involved, we can assume the level of sharing is
> low.
> 4) We then need to shrink memory from that memcg, affecting the others
> the least we can.
>
> Do you have any proposals for that, in any shape?
>
> One thing that crossed my mind, was instead of having per-sb x per-node
> objects, we could have per-"group" x per-node objects. The group would
> then be either a memcg or a sb. Objects that doesn't belong to a memcg -
> where we expect most of the globally accessed to fall, would be tied to
> the sb. Global shrinkers, when called, would of course scan all groups.
> Shrinking could also be triggered for the group. An object would of
> course only live in one of them at a time.

Not sure I understand this. Will think a bit more tomorrow morning
when my brain works better :)

--Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]