Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Sat, 18 Jun 2011 15:30:38 +0200

Hi everyone,

I would suggest to re-submit the first few locking improvements that
are independent of the per-container dentry limit. Increasing the
seqlock if there's no modification to the struct is unnecessary, looks
nice and we don't want it lost if valid microopt. And the patchset
size to discuss will decrease too ;).

On Sat, May 07, 2011 at 10:01:08AM +1000, Dave Chinner wrote:
> They aren't immediately reclaimable - they are all still pinned by
> the VFS inode (L2) cache, and will be dirtied by having to truncate
> away speculative allocation beyond EOF when the VFS inode cache
> frees them. So there is IO required on all of those inodes before
> they can be reclaim. That's why the caches have ended up with this
> size ratio, and that's from a long running, steady-state workload.
> Controlling the dentry cache size won't help reduce that inode cache
> size one bit on such workloads....

Certainly opening a flood of inodes, changing some attribute and
writing 1 page to disk, by reusing the same dentry wouldn't provide a
too nice effect but from the container point of view it'd be still
better than a unlimited number of simultaneous pinned inodes which
makes far too easy to DoS. Maybe next step would be to require some
other logic to limit the number of dirty inodes that can be opened by
a container. And waiting on inode writeback and pagecache writeback
and shrinkage during open(2), won't even -EFAIL but it'd just wait so
it'd be more graceful than the effect of too many dentries.  That
would likely be a lot more complex than a dentry limit though... so if
that is next thing to expect we should take that into account from
complexity point of view.

Overall -ENOMEM failures with d_alloc returning -ENOMEM in open(2)
aren't so nice for apps, which is I'm not so fond of the container
virt vs a virt where the container manages its memory and no DoS issue
like this one can ever materialize for the host, and it requires no
added complexity to the host. The container approach won't ever be as
reliable as guest OS in avoiding these issues, so we maybe shouldn't
complain that this solution isn't perfect for the inode cache, when
clearly it will help their usage.

> > > global lists and locks for LRU, shrinker and mob management is the
> > > opposite direction we are taking - we want to make the LRUs more
> > > fine-grained and more closely related to the MM structures,
> > > shrinkers confined to per-sb context (no more lifecycle issues,
> > > ever) and operate per-node/-zone rather than globally, etc.  It
> > > seems to me that this containerisation will make much of that work
> > > difficult to acheive effectively because it doesn't take any of this
> > > ongoing scalability work into account.
> > 
> > Two things from my side on this:
> > 
> > 1. Can you be more specific on this - which parts of VFS suffer from the
> > LRU being global?
> 
> Performance. It doesn't scale beyond a few CPUs before lock
> contention becomes the limiting factor.

The global vs per-zone/numa dentry lru vs global seems an interesting
point. Probably here we've two different point of views that asks for
development going into two different directions because of different
production objectives and priorities. This won't be as easy to get an
agreement on.

Maybe I remember wrong, but I seem to recall Nick was proposing to
split per-zone/numa vfs lrus too, and Christoph was against it (or
maybe it was the other way around :). But it wasn't discussed in
container context and I don't remember exactly what the cons
were. Global usually provides better lru behavior, and splitting
arbitrarily among zone/node tends to be a scalability boost but if we
only see it as a lock-scalability improvement it becomes a tradeoff
between better lru info and better scalability, so then we could
arbitrarily split the lru without regard of the actual zone/node
size. It is much better to split lrus on zone/node boundaries when
there's a real need to shrink specific zones/nodes from a reclaim
point of view, not just better scalability when taking the lock.

We obviously use that zone/node lru split in the pagecache lru, and
clearly when we've to shrink a single node it helps more than just for
lock scalability, so the per-node lrus is certainly needed in NUMA
setups with HARDWALL NUMA pins as it saves a ton of CPU and it avoids
global lru churning as well. So maybe a per zone/node lru would
provide similar benefits for vfs caches and it would be indeed the
right direction for MM point of view (ignoring this very issue of
containers). Right now we do blind vfs shrinks when we could do
selective zone/node ones as the vfs shrinker caller has the zone/node
info already. Maybe whoever was against it (regardless of this
container dentry limit discussion) should point out what the cons are.

> I never implied quotas were for limiting cache usage. I only
> suggested they were the solution to your DOS example by preventing
> unbound numbers of inodes from being created by an unprivileged
> user.
> 
> To me, it sounds like you overprovision your servers and then
> have major troubles when everyone tries to use what you supplied
> them with simultaneously. There is a simple solution to that. ;)
> Otherwise, I think you need to directly limit the size of the inode
> caches, not try to do it implicitly via 2nd and 3rd order side
> effects of controlling the size of the dentry cache.

They want to limit the number of simultaneously pinned amount of
kernel ram structures, while still leaving huge amount of files
possible in the filesystem to make life simple during install
etc... So you can untar whatever size of backup into the container
regardless of quotas, but if only a part of the unpacked data (common
case) is used by the apps it just works. Again I don't think the
objective is a perfect accounting but just something that happen to
works better, if one wants perfect accounting of the memory and bytes
utilized by the on-disk image there are other types of virt available.

Yet another approach would be to account how much kernel data
structures each process is keeping pinned and unfreeable and sum that
to the process RAM during the oom killer decision, but that wouldn't
be an hard per container limit and it sounds way too CPU costly to
account the vfs pinned RAM every time somebody open() or chdir(), it'd
require to count too many things towards the dentry root.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html