Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 20 Jun 2011 10:49:46 +1000

On Sat, Jun 18, 2011 at 03:30:38PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> I would suggest to re-submit the first few locking improvements that
> are independent of the per-container dentry limit. Increasing the
> seqlock if there's no modification to the struct is unnecessary, looks
> nice and we don't want it lost if valid microopt. And the patchset
> size to discuss will decrease too ;).
> 
> On Sat, May 07, 2011 at 10:01:08AM +1000, Dave Chinner wrote:
> > They aren't immediately reclaimable - they are all still pinned by
> > the VFS inode (L2) cache, and will be dirtied by having to truncate
> > away speculative allocation beyond EOF when the VFS inode cache
> > frees them. So there is IO required on all of those inodes before
> > they can be reclaim. That's why the caches have ended up with this
> > size ratio, and that's from a long running, steady-state workload.
> > Controlling the dentry cache size won't help reduce that inode cache
> > size one bit on such workloads....
> 
> Certainly opening a flood of inodes, changing some attribute and
> writing 1 page to disk, by reusing the same dentry wouldn't provide a
> too nice effect but from the container point of view it'd be still
> better than a unlimited number of simultaneous pinned inodes which
> makes far too easy to DoS.

Perhaps you haven't understood how the VFS cache reclaim works? The
dentry cache shrinker runs before the inode cache shrinker, and
hence we unpin inodes before trying to reclaim inodes. Hence you can't
"DOS" the system by reading dentries and using them to pin
inodes....

> Maybe next step would be to require some
> other logic to limit the number of dirty inodes that can be opened by
> a container.

You don't "open" dirty inodes. Inodes are clean until they are
marked dirty as a result of some other operation. Hence a limit on
the number of dirty inodes in a container can only be done via a
limit on the total number of inodes.

> And waiting on inode writeback and pagecache writeback
> and shrinkage during open(2),

What makes you think open(2) is the only consumer of inodes? e.g.
"ls -l" causes the dentry/inode cache to be populated via
lstat(2). Another example: NFS file handle lookup....

> won't even -EFAIL but it'd just wait so
> it'd be more graceful than the effect of too many dentries.  That
> would likely be a lot more complex than a dentry limit though... so if
> that is next thing to expect we should take that into account from
> complexity point of view.
> 
> Overall -ENOMEM failures with d_alloc returning -ENOMEM in open(2)
> aren't so nice for apps, which is I'm not so fond of the container
> virt vs a virt where the container manages its memory and no DoS issue
> like this one can ever materialize for the host, and it requires no
> added complexity to the host. The container approach won't ever be as
> reliable as guest OS in avoiding these issues, so we maybe shouldn't
> complain that this solution isn't perfect for the inode cache, when
> clearly it will help their usage.

I'm not sure what point you are trying to get across here? What do
containers have to do with how applications handle errors that can
already occur?

> > > > global lists and locks for LRU, shrinker and mob management is the
> > > > opposite direction we are taking - we want to make the LRUs more
> > > > fine-grained and more closely related to the MM structures,
> > > > shrinkers confined to per-sb context (no more lifecycle issues,
> > > > ever) and operate per-node/-zone rather than globally, etc.  It
> > > > seems to me that this containerisation will make much of that work
> > > > difficult to acheive effectively because it doesn't take any of this
> > > > ongoing scalability work into account.
> > > 
> > > Two things from my side on this:
> > > 
> > > 1. Can you be more specific on this - which parts of VFS suffer from the
> > > LRU being global?
> > 
> > Performance. It doesn't scale beyond a few CPUs before lock
> > contention becomes the limiting factor.
> 
> The global vs per-zone/numa dentry lru vs global seems an interesting
> point. Probably here we've two different point of views that asks for
> development going into two different directions because of different
> production objectives and priorities. This won't be as easy to get an
> agreement on.
> 
> Maybe I remember wrong, but I seem to recall Nick was proposing to
> split per-zone/numa vfs lrus too, and Christoph was against it (or
> maybe it was the other way around :).

It was Nick and myself that differed in opinion, not Christoph ;)

> But it wasn't discussed in
> container context and I don't remember exactly what the cons
> were. Global usually provides better lru behavior, and splitting
> arbitrarily among zone/node tends to be a scalability boost but if we
> only see it as a lock-scalability improvement it becomes a tradeoff
> between better lru info and better scalability, so then we could
> arbitrarily split the lru without regard of the actual zone/node
> size.

Which was my argument - that filesystems scale along different axis
than the VM, so tightly integrating the LRU implementation at a high
level to the MM architecture is not necessarily the right thing to
do in all cases.  If we tie everything to the MM architecture,
anything that scales along different axis to the MM subsystem is
stuck with a nasty impedence mismatch.

Point in case: XFS scales it's internal inode cache via per
allocation group structures, not per-NUMA zone. It does this to
provide optimal IO scheduling when reclaiming inodes because the
cost of making bad IO decisions is orders of magnitude worse than
reclaiming an object that the VM doesn't consider necessary for
reclaim...

IOWs, XFS optimises inode reclaim for high IO throughput as IO has
much higher cost than spending CPU time scanning lists. Hence
filesystems often have a fundamentally different memory reclaim
scalability problem to the MM subsystem, but Nick considered that
problem irrelevant to the architecture of the VFS cache reclaim
subsystem. That was the basis of our disagreement....

> It is much better to split lrus on zone/node boundaries when
> there's a real need to shrink specific zones/nodes from a reclaim
> point of view, not just better scalability when taking the lock.
>
> We obviously use that zone/node lru split in the pagecache lru, and
> clearly when we've to shrink a single node it helps more than just for
> lock scalability, so the per-node lrus is certainly needed in NUMA
> setups with HARDWALL NUMA pins as it saves a ton of CPU and it avoids
> global lru churning as well.

Which was Nick's argument - it should all be done according to how
the MM subsystem sees the world....

> So maybe a per zone/node lru would
> provide similar benefits for vfs caches and it would be indeed the
> right direction for MM point of view (ignoring this very issue of
> containers). Right now we do blind vfs shrinks when we could do
> selective zone/node ones as the vfs shrinker caller has the zone/node
> info already. Maybe whoever was against it (regardless of this
> container dentry limit discussion) should point out what the cons are.

Nick's implementation of the per-mm-zone LRUs was the biggest
problem  - it tightly coupled the VFS cache LRUs directly to the
struct zone in the mm.  This meant that any subsystem that wanted to
use the same per-node LRU + shrinker infrastructure needed to tie
deeply into the MM architecture.  IOWs, it didn't provide any
abstraction from the VM, nor provide the necessary flexibility for
subsystems to use their own LRUs or object reclaim tracking
infrstructure....

> > I never implied quotas were for limiting cache usage. I only
> > suggested they were the solution to your DOS example by preventing
> > unbound numbers of inodes from being created by an unprivileged
> > user.
> > 
> > To me, it sounds like you overprovision your servers and then
> > have major troubles when everyone tries to use what you supplied
> > them with simultaneously. There is a simple solution to that. ;)
> > Otherwise, I think you need to directly limit the size of the inode
> > caches, not try to do it implicitly via 2nd and 3rd order side
> > effects of controlling the size of the dentry cache.
> 
> They want to limit the number of simultaneously pinned amount of
> kernel ram structures, while still leaving huge amount of files
> possible in the filesystem to make life simple during install

But we don't pin the memory in the vfs caches forever - it gets
freed when we run out of free memory. The caches only grow large
when there isn't any other selection pressure (i.e. application or
page cache) to cause the VFS caches to shrink. In reality, this is a
generic problem that people have been hitting for years, and it is
not specific to containerised configurations.

> etc... So you can untar whatever size of backup into the container
> regardless of quotas, but if only a part of the unpacked data (common
> case) is used by the apps it just works.

Well, it does just work in most cases, containerised or not. It's
when you push the boundaries (effectively overcommit resources) that
the current cache reclaim algorithms fail, containerised or not.

The point I'm trying to get across is that this problem is not
unique to containerised systems, so a solution that is tailored to a
specific containerised system implementation does not solve the
generic problem that is the root cause.

> Again I don't think the
> objective is a perfect accounting but just something that happen to
> works better, if one wants perfect accounting of the memory and bytes
> utilized by the on-disk image there are other types of virt available.

IOWs, they want per-container resource limits. I'd suggest that any
solution along these lines needs to use existing infrastructure (i.e
cgroups) to control the resource usage of a given container...

FYI, the way I am trying to solve this problem is as follows:

	1. Encode the reclaim dependency in the VFS cache memory
	   reclaim implementation.

		-> per-sb shrinker implementation
			-> per-sb LRU lists
			-> per-sb locking
			-> binds dentry, inode and filesystem inode
			   cache reclaim together
			-> allows LRU scalability to be adressed
			   indepenedently at a future time
		-> patches already out for review

	2. provide global cache limiting at inode/dentry allocation
	   time

		-> calls per-sb shrinker to free inodes on same sb
		-> can be done asynchrounously
		-> no new locking/lifecycle issues
		-> no cross-sb reference/locking issues

	3. add cache size limiting to cgroup infrastructure
		-> just another level of abstraction to existing
		   infrastructure
		-> ties in with existing resource limiting
		   mechanisms in the kernel

Basically the concept of "mobs" ends up being subsumed by
cgroups, but only at the LRU level and has no hooks into the dentry
cache heirarchy at all. i.e. reclaim works just like it does now,
but it is simply container-aware. We're already having to solve
these issues for cgroup-aware dirty page writeback (i.e. making the
bdi-flusher infrastructure cgroup aware), so this is not as big a
leap as you might think. It also avoids the need for a one-off
configuration ABI just for controlling the dentry cache....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html