Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)

Pavel Emelyanov <xemul@xxxxxxxxxxxxx> · Fri, 06 May 2011 16:15:50 +0400

On 05/06/2011 05:05 AM, Dave Chinner wrote:
> On Tue, May 03, 2011 at 04:14:37PM +0400, Pavel Emelyanov wrote:
>> Hi.
>>
>> According to the "release early, release often" strategy :) I'm
>> glad to propose this scratch implementation of what I was talking
>> about at the LSF - the way to limit the dcache grow for both
>> containerized and not systems (the set applies to 2.6.38).
> 
> dcache growth is rarely the memory consumption problem in systems -
> it's inode cache growth that is the issue. Each inodes consumes 4-5x
> as much memory as a dentry, and the dentry lifecycle is a subset of
> the inode lifecycle.  Limiting the number of dentries will do very
> little to relieve memory problems because of this.

No, you don't take into account that once we have the dentry cache shrunk
the inode cache can be also shrunk (since there's no objects other than
dentries, that hold inodes in cache), but not the vice versa. That said -- 
if we keep the dentry cache from growing it becomes possible to keep the 
inode cache from growing.

> Indeed, I actually get a request from embedded folks every so often
> to limit the size of the inode cache - they never have troubles with
> the size of the dentry cache (and I do ask) - so perhaps you need to
> consider this aspect of the problem a bit more.
> 
> FWIW, I often see machines during tests where the dentry cache is
> empty, yet there are millions of inodes cached on the inode LRU
> consuming gigabytes of memory. e.g a snapshot from my 4GB RAM test
> VM right now:
> 
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 2180754 2107387  96%    0.21K 121153       18    484612K xfs_ili
> 2132964 2107389  98%    1.00K 533241        4   2132964K xfs_inode
> 1625922 944034   58%    0.06K  27558       59    110232K size-64
> 415320 415301    99%    0.19K  20766       20     83064K dentry
> 
> You see 400k active dentries consume 83MB of ram, yet 2.1M active
> inodes consuming ~2.6GB of RAM. We've already reclaimed the dentry
> cache down quite small, while the inode cache remains the dominant
> memory consumer.....

Same here - this 2.6GB of RAM is shrinkable memory (unless xfs inodes 
references are leaked).

> I'm also concerned about the scalability issues - moving back to
> global lists and locks for LRU, shrinker and mob management is the
> opposite direction we are taking - we want to make the LRUs more
> fine-grained and more closely related to the MM structures,
> shrinkers confined to per-sb context (no more lifecycle issues,
> ever) and operate per-node/-zone rather than globally, etc.  It
> seems to me that this containerisation will make much of that work
> difficult to acheive effectively because it doesn't take any of this
> ongoing scalability work into account.

Two things from my side on this:

1. Can you be more specific on this - which parts of VFS suffer from the
LRU being global? The only thing I found was the problem with shrinking
the dcache for some sb on umount, but in my patch #4 I made both routines
doing it work on dentry tree, not the LRU list and thus the global LRU is
no longer an issue at this point.

2. If for any reason you do need to keep LRU per super block (please share
one if you do) we can create mobs per super block :) In other words - with
mobs we're much more flexible with how to manage dentry LRU-s rather than
with per-sb LRU-s.

>> The first 5 patches are preparations for this, descriptive (I hope)
>> comments are inside them.
>>
>> The general idea of this set is -- make the dentries subtrees be
>> limited in size and shrink them as they hit the configured limit.
> 
> And if the inode cache that does not shrink with it?

Yet again - that's not a big deal. Once we killed dentries, the inodes are
no longer pinned in memory and the very first try_to_free_pages can free them.

>> Why subtrees? Because this lets having the [dentry -> group] reference
>> without the reference count, letting the [dentry -> parent] one handle
>> this.
>>
>> Why limited? For containers the answer is simple -- a container
>> should not be allowed to consume too much of the host memory. For
>> non-containerized systems the answer is -- to protect the kernel
>> from the non-privileged attacks on the dcache memory like the 
>> "while :; do mkdir x; cd x; done" one and similar.
> 
> Which will stop as soon as the path gets too long. 

No, it will *not*! Bash will start complaining, that the won't be able to set
the CWD env. variable, but once you turn this into a C program...

> And if this is really a problem on your systems, quotas can prevent this from 
> ever being an issue....

Disagree.

Let's take the minimal CentOS5.5 container. It contains ~30K files, but in this
container there's no data like web server static pages/scripts, databases,
devel tools, etc. Thus we cannot configure the quota for this container with
less limit. I'd assume that 50K inodes is the minimal what we should set (for
the record - default quota size for this in OpenVZ is 200000, but people most
often increase one).

Having on x64_64 one dentry take ~200 bytes and one ext4 inode take ~1K we
give this container the ability to lock 50K * (200 + 1K) ~ 60M of RAM.

As our experience shows if you have a node with e.g. 2G of RAM you can easily host
up to 20 containers with LAMP stack (you can host more, but this will be notably slow).

Thus, trying to handle the issue with disk quota you are giving your containers
the ability to lock up to 1.2Gb of RAM with dcache + icache. This is way too many.

>> What isn't in this patch yet, but should be done after the discussion
>>
>> * API. I haven't managed to invent any perfect solution, and would
>> really like to have it discussed. In order to be able to play with it 
>> the ioctls + proc for listing are proposed.
>>
>> * New mounts management. Right now if you mount some new FS to a
>> dentry which belongs to some managed set (I named it "mob" in this
>> patchset), the new mount is managed with the system settings. This is
>> not OK, the new mount should be managed with the settings of the
>> mountpoint's mob.
>>
>> * Elegant shrink_dcache_memory on global memory shortage. By now the
>> code walks the mobs and shinks some equal amount of dentries from them.
>> Better shrinking policy can and probably should be implemented.
> 
> See above.
> 
> Cheers,
> 
> Dave.

Thanks,
Pavel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html