Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 7 May 2011 10:01:08 +1000

On Fri, May 06, 2011 at 04:15:50PM +0400, Pavel Emelyanov wrote:
> On 05/06/2011 05:05 AM, Dave Chinner wrote:
> > On Tue, May 03, 2011 at 04:14:37PM +0400, Pavel Emelyanov wrote:
> >> Hi.
> >>
> >> According to the "release early, release often" strategy :) I'm
> >> glad to propose this scratch implementation of what I was talking
> >> about at the LSF - the way to limit the dcache grow for both
> >> containerized and not systems (the set applies to 2.6.38).
> > 
> > dcache growth is rarely the memory consumption problem in systems -
> > it's inode cache growth that is the issue. Each inodes consumes 4-5x
> > as much memory as a dentry, and the dentry lifecycle is a subset of
> > the inode lifecycle.  Limiting the number of dentries will do very
> > little to relieve memory problems because of this.
> 
> No, you don't take into account that once we have the dentry cache shrunk
> the inode cache can be also shrunk (since there's no objects other than
> dentries, that hold inodes in cache), but not the vice versa. That said -- 
> if we keep the dentry cache from growing it becomes possible to keep the 
> inode cache from growing.

That's a fairly naive view of the way the caches interact. Unlike
dentries, inodes can be dirtied and can't be reclaimed until they
are clean. That requires IO. Hence the inode cache can't be
reclaimed as easily as the dentry cache, nor can controlling the
size of the dentry cache control the size of the inode
cache. At best, it's a second order effect.

Effectively, what you have is:

	L1 cache = dentry cache
	L2 cache = VFS inode cache,
			pinned by L1,
			pinned by dirty state
	L3 cache = 1st level FS inode cache,
			pinned by L2
			pinned by dirty state

None of the cache sizes are fixed, and overall size is limited only
by RAM, so you will always tend to have the L3 cache dominate memory
usage because:

	a) they are the largest objects in the heirarchy; and
	b) they are pinned by the L1 and L2 caches and need to be
	freed from those caches first.

If you limit the size of the L2/L3 inode cache, you immediately
limit the size of the dentry cache for everything but heavy users of
hard links. If you can't allocate more inodes, you can't allocate a
new dentry.

> > Indeed, I actually get a request from embedded folks every so often
> > to limit the size of the inode cache - they never have troubles with
> > the size of the dentry cache (and I do ask) - so perhaps you need to
> > consider this aspect of the problem a bit more.
> > 
> > FWIW, I often see machines during tests where the dentry cache is
> > empty, yet there are millions of inodes cached on the inode LRU
> > consuming gigabytes of memory. e.g a snapshot from my 4GB RAM test
> > VM right now:
> > 
> >   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> > 2180754 2107387  96%    0.21K 121153       18    484612K xfs_ili
> > 2132964 2107389  98%    1.00K 533241        4   2132964K xfs_inode
> > 1625922 944034   58%    0.06K  27558       59    110232K size-64
> > 415320 415301    99%    0.19K  20766       20     83064K dentry
> > 
> > You see 400k active dentries consume 83MB of ram, yet 2.1M active
> > inodes consuming ~2.6GB of RAM. We've already reclaimed the dentry
> > cache down quite small, while the inode cache remains the dominant
> > memory consumer.....
> 
> Same here - this 2.6GB of RAM is shrinkable memory (unless xfs inodes 
> references are leaked).

They aren't immediately reclaimable - they are all still pinned by
the VFS inode (L2) cache, and will be dirtied by having to truncate
away speculative allocation beyond EOF when the VFS inode cache
frees them. So there is IO required on all of those inodes before
they can be reclaim. That's why the caches have ended up with this
size ratio, and that's from a long running, steady-state workload.
Controlling the dentry cache size won't help reduce that inode cache
size one bit on such workloads....

> > global lists and locks for LRU, shrinker and mob management is the
> > opposite direction we are taking - we want to make the LRUs more
> > fine-grained and more closely related to the MM structures,
> > shrinkers confined to per-sb context (no more lifecycle issues,
> > ever) and operate per-node/-zone rather than globally, etc.  It
> > seems to me that this containerisation will make much of that work
> > difficult to acheive effectively because it doesn't take any of this
> > ongoing scalability work into account.
> 
> Two things from my side on this:
> 
> 1. Can you be more specific on this - which parts of VFS suffer from the
> LRU being global?

Performance. It doesn't scale beyond a few CPUs before lock
contention becomes the limiting factor.

> The only thing I found was the problem with shrinking
> the dcache for some sb on umount, but in my patch #4 I made both routines
> doing it work on dentry tree, not the LRU list and thus the global LRU is
> no longer an issue at this point.

Actually, it is, because you've still got to remove the dentry from
the LRU to free it, which means traversing the global lock.

> 2. If for any reason you do need to keep LRU per super block (please share
> one if you do) we can create mobs per super block :) In other words - with
> mobs we're much more flexible with how to manage dentry LRU-s rather than
> with per-sb LRU-s.

Because of the heirarchical nature of the caches, and the fact that
we've got to jump through hoops to make sure the superblock doesn't
go away while we are doing a shrinker walk (the s_umount lock
problem). Move to a per-sb shrinker means the shrinker callback has
the same life cycle as the superblock, and we no longer have a big
mess of locking and lifecycle concerns in memory reclaim.

On top of that, a single shrinker callout that shrinkers the dentry,
VFS inode and FS inode caches in a single call means we do larger
chunks of work on each superblock at a time instead of a small
handful of dentries or inodes per shrinker call as the current
"proportion across all sbs" code currently works. That will give
reclaim a smaller CPU cache footprint with higher hit rates, so
should significantly reduce the CPU usage of shrinking the caches as
well.

Not to mention having a per-sb shrinker means that you can call the
shrinker from inode allocation when you run out of inodes, and it
will shrink the dentry cache, the VFS inode cache and the FS inode
cache in the correct order to free up inodes as quickly as possible
to allow the new inode allocation to occur....

> >> The first 5 patches are preparations for this, descriptive (I hope)
> >> comments are inside them.
> >>
> >> The general idea of this set is -- make the dentries subtrees be
> >> limited in size and shrink them as they hit the configured limit.
> > 
> > And if the inode cache that does not shrink with it?
> 
> Yet again - that's not a big deal. Once we killed dentries, the inodes are
> no longer pinned in memory and the very first try_to_free_pages can free them.

See above - the inode cache does not shrink in proportion with the
dentry cache.

> >> Why subtrees? Because this lets having the [dentry -> group] reference
> >> without the reference count, letting the [dentry -> parent] one handle
> >> this.
> >>
> >> Why limited? For containers the answer is simple -- a container
> >> should not be allowed to consume too much of the host memory. For
> >> non-containerized systems the answer is -- to protect the kernel
> >> from the non-privileged attacks on the dcache memory like the 
> >> "while :; do mkdir x; cd x; done" one and similar.
> > 
> > Which will stop as soon as the path gets too long. 
> 
> No, it will *not*! Bash will start complaining, that the won't be able to set
> the CWD env. variable, but once you turn this into a C program...

Sounds like a bug in bash ;)

> > And if this is really a problem on your systems, quotas can prevent this from 
> > ever being an issue....
> 
> Disagree.
> 
> Let's take the minimal CentOS5.5 container. It contains ~30K files, but in this
> container there's no data like web server static pages/scripts, databases,
> devel tools, etc. Thus we cannot configure the quota for this container with
> less limit. I'd assume that 50K inodes is the minimal what we should set (for
> the record - default quota size for this in OpenVZ is 200000, but people most
> often increase one).
>
> Having on x64_64 one dentry take ~200 bytes and one ext4 inode take ~1K we
> give this container the ability to lock 50K * (200 + 1K) ~ 60M of RAM.
>
> As our experience shows if you have a node with e.g. 2G of RAM you can easily host
> up to 20 containers with LAMP stack (you can host more, but this will be notably slow).
> 
> Thus, trying to handle the issue with disk quota you are giving your containers
> the ability to lock up to 1.2Gb of RAM with dcache + icache. This is way too many.

<sigh>

I never implied quotas were for limiting cache usage. I only
suggested they were the solution to your DOS example by preventing
unbound numbers of inodes from being created by an unprivileged
user.

To me, it sounds like you overprovision your servers and then
have major troubles when everyone tries to use what you supplied
them with simultaneously. There is a simple solution to that. ;)
Otherwise, I think you need to directly limit the size of the inode
caches, not try to do it implicitly via 2nd and 3rd order side
effects of controlling the size of the dentry cache.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html