On Sat, Jun 18, 2011 at 03:30:38PM +0200, Andrea Arcangeli wrote: > Hi everyone, > > I would suggest to re-submit the first few locking improvements that > are independent of the per-container dentry limit. Increasing the > seqlock if there's no modification to the struct is unnecessary, looks > nice and we don't want it lost if valid microopt. And the patchset > size to discuss will decrease too ;). > > On Sat, May 07, 2011 at 10:01:08AM +1000, Dave Chinner wrote: > > They aren't immediately reclaimable - they are all still pinned by > > the VFS inode (L2) cache, and will be dirtied by having to truncate > > away speculative allocation beyond EOF when the VFS inode cache > > frees them. So there is IO required on all of those inodes before > > they can be reclaim. That's why the caches have ended up with this > > size ratio, and that's from a long running, steady-state workload. > > Controlling the dentry cache size won't help reduce that inode cache > > size one bit on such workloads.... > > Certainly opening a flood of inodes, changing some attribute and > writing 1 page to disk, by reusing the same dentry wouldn't provide a > too nice effect but from the container point of view it'd be still > better than a unlimited number of simultaneous pinned inodes which > makes far too easy to DoS. Perhaps you haven't understood how the VFS cache reclaim works? The dentry cache shrinker runs before the inode cache shrinker, and hence we unpin inodes before trying to reclaim inodes. Hence you can't "DOS" the system by reading dentries and using them to pin inodes.... > Maybe next step would be to require some > other logic to limit the number of dirty inodes that can be opened by > a container. You don't "open" dirty inodes. Inodes are clean until they are marked dirty as a result of some other operation. Hence a limit on the number of dirty inodes in a container can only be done via a limit on the total number of inodes. > And waiting on inode writeback and pagecache writeback > and shrinkage during open(2), What makes you think open(2) is the only consumer of inodes? e.g. "ls -l" causes the dentry/inode cache to be populated via lstat(2). Another example: NFS file handle lookup.... > won't even -EFAIL but it'd just wait so > it'd be more graceful than the effect of too many dentries. That > would likely be a lot more complex than a dentry limit though... so if > that is next thing to expect we should take that into account from > complexity point of view. > > Overall -ENOMEM failures with d_alloc returning -ENOMEM in open(2) > aren't so nice for apps, which is I'm not so fond of the container > virt vs a virt where the container manages its memory and no DoS issue > like this one can ever materialize for the host, and it requires no > added complexity to the host. The container approach won't ever be as > reliable as guest OS in avoiding these issues, so we maybe shouldn't > complain that this solution isn't perfect for the inode cache, when > clearly it will help their usage. I'm not sure what point you are trying to get across here? What do containers have to do with how applications handle errors that can already occur? > > > > global lists and locks for LRU, shrinker and mob management is the > > > > opposite direction we are taking - we want to make the LRUs more > > > > fine-grained and more closely related to the MM structures, > > > > shrinkers confined to per-sb context (no more lifecycle issues, > > > > ever) and operate per-node/-zone rather than globally, etc. It > > > > seems to me that this containerisation will make much of that work > > > > difficult to acheive effectively because it doesn't take any of this > > > > ongoing scalability work into account. > > > > > > Two things from my side on this: > > > > > > 1. Can you be more specific on this - which parts of VFS suffer from the > > > LRU being global? > > > > Performance. It doesn't scale beyond a few CPUs before lock > > contention becomes the limiting factor. > > The global vs per-zone/numa dentry lru vs global seems an interesting > point. Probably here we've two different point of views that asks for > development going into two different directions because of different > production objectives and priorities. This won't be as easy to get an > agreement on. > > Maybe I remember wrong, but I seem to recall Nick was proposing to > split per-zone/numa vfs lrus too, and Christoph was against it (or > maybe it was the other way around :). It was Nick and myself that differed in opinion, not Christoph ;) > But it wasn't discussed in > container context and I don't remember exactly what the cons > were. Global usually provides better lru behavior, and splitting > arbitrarily among zone/node tends to be a scalability boost but if we > only see it as a lock-scalability improvement it becomes a tradeoff > between better lru info and better scalability, so then we could > arbitrarily split the lru without regard of the actual zone/node > size. Which was my argument - that filesystems scale along different axis than the VM, so tightly integrating the LRU implementation at a high level to the MM architecture is not necessarily the right thing to do in all cases. If we tie everything to the MM architecture, anything that scales along different axis to the MM subsystem is stuck with a nasty impedence mismatch. Point in case: XFS scales it's internal inode cache via per allocation group structures, not per-NUMA zone. It does this to provide optimal IO scheduling when reclaiming inodes because the cost of making bad IO decisions is orders of magnitude worse than reclaiming an object that the VM doesn't consider necessary for reclaim... IOWs, XFS optimises inode reclaim for high IO throughput as IO has much higher cost than spending CPU time scanning lists. Hence filesystems often have a fundamentally different memory reclaim scalability problem to the MM subsystem, but Nick considered that problem irrelevant to the architecture of the VFS cache reclaim subsystem. That was the basis of our disagreement.... > It is much better to split lrus on zone/node boundaries when > there's a real need to shrink specific zones/nodes from a reclaim > point of view, not just better scalability when taking the lock. > > We obviously use that zone/node lru split in the pagecache lru, and > clearly when we've to shrink a single node it helps more than just for > lock scalability, so the per-node lrus is certainly needed in NUMA > setups with HARDWALL NUMA pins as it saves a ton of CPU and it avoids > global lru churning as well. Which was Nick's argument - it should all be done according to how the MM subsystem sees the world.... > So maybe a per zone/node lru would > provide similar benefits for vfs caches and it would be indeed the > right direction for MM point of view (ignoring this very issue of > containers). Right now we do blind vfs shrinks when we could do > selective zone/node ones as the vfs shrinker caller has the zone/node > info already. Maybe whoever was against it (regardless of this > container dentry limit discussion) should point out what the cons are. Nick's implementation of the per-mm-zone LRUs was the biggest problem - it tightly coupled the VFS cache LRUs directly to the struct zone in the mm. This meant that any subsystem that wanted to use the same per-node LRU + shrinker infrastructure needed to tie deeply into the MM architecture. IOWs, it didn't provide any abstraction from the VM, nor provide the necessary flexibility for subsystems to use their own LRUs or object reclaim tracking infrstructure.... > > I never implied quotas were for limiting cache usage. I only > > suggested they were the solution to your DOS example by preventing > > unbound numbers of inodes from being created by an unprivileged > > user. > > > > To me, it sounds like you overprovision your servers and then > > have major troubles when everyone tries to use what you supplied > > them with simultaneously. There is a simple solution to that. ;) > > Otherwise, I think you need to directly limit the size of the inode > > caches, not try to do it implicitly via 2nd and 3rd order side > > effects of controlling the size of the dentry cache. > > They want to limit the number of simultaneously pinned amount of > kernel ram structures, while still leaving huge amount of files > possible in the filesystem to make life simple during install But we don't pin the memory in the vfs caches forever - it gets freed when we run out of free memory. The caches only grow large when there isn't any other selection pressure (i.e. application or page cache) to cause the VFS caches to shrink. In reality, this is a generic problem that people have been hitting for years, and it is not specific to containerised configurations. > etc... So you can untar whatever size of backup into the container > regardless of quotas, but if only a part of the unpacked data (common > case) is used by the apps it just works. Well, it does just work in most cases, containerised or not. It's when you push the boundaries (effectively overcommit resources) that the current cache reclaim algorithms fail, containerised or not. The point I'm trying to get across is that this problem is not unique to containerised systems, so a solution that is tailored to a specific containerised system implementation does not solve the generic problem that is the root cause. > Again I don't think the > objective is a perfect accounting but just something that happen to > works better, if one wants perfect accounting of the memory and bytes > utilized by the on-disk image there are other types of virt available. IOWs, they want per-container resource limits. I'd suggest that any solution along these lines needs to use existing infrastructure (i.e cgroups) to control the resource usage of a given container... FYI, the way I am trying to solve this problem is as follows: 1. Encode the reclaim dependency in the VFS cache memory reclaim implementation. -> per-sb shrinker implementation -> per-sb LRU lists -> per-sb locking -> binds dentry, inode and filesystem inode cache reclaim together -> allows LRU scalability to be adressed indepenedently at a future time -> patches already out for review 2. provide global cache limiting at inode/dentry allocation time -> calls per-sb shrinker to free inodes on same sb -> can be done asynchrounously -> no new locking/lifecycle issues -> no cross-sb reference/locking issues 3. add cache size limiting to cgroup infrastructure -> just another level of abstraction to existing infrastructure -> ties in with existing resource limiting mechanisms in the kernel Basically the concept of "mobs" ends up being subsumed by cgroups, but only at the LRU level and has no hooks into the dentry cache heirarchy at all. i.e. reclaim works just like it does now, but it is simply container-aware. We're already having to solve these issues for cgroup-aware dirty page writeback (i.e. making the bdi-flusher infrastructure cgroup aware), so this is not as big a leap as you might think. It also avoids the need for a one-off configuration ABI just for controlling the dentry cache.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html