Hi everyone, I would suggest to re-submit the first few locking improvements that are independent of the per-container dentry limit. Increasing the seqlock if there's no modification to the struct is unnecessary, looks nice and we don't want it lost if valid microopt. And the patchset size to discuss will decrease too ;). On Sat, May 07, 2011 at 10:01:08AM +1000, Dave Chinner wrote: > They aren't immediately reclaimable - they are all still pinned by > the VFS inode (L2) cache, and will be dirtied by having to truncate > away speculative allocation beyond EOF when the VFS inode cache > frees them. So there is IO required on all of those inodes before > they can be reclaim. That's why the caches have ended up with this > size ratio, and that's from a long running, steady-state workload. > Controlling the dentry cache size won't help reduce that inode cache > size one bit on such workloads.... Certainly opening a flood of inodes, changing some attribute and writing 1 page to disk, by reusing the same dentry wouldn't provide a too nice effect but from the container point of view it'd be still better than a unlimited number of simultaneous pinned inodes which makes far too easy to DoS. Maybe next step would be to require some other logic to limit the number of dirty inodes that can be opened by a container. And waiting on inode writeback and pagecache writeback and shrinkage during open(2), won't even -EFAIL but it'd just wait so it'd be more graceful than the effect of too many dentries. That would likely be a lot more complex than a dentry limit though... so if that is next thing to expect we should take that into account from complexity point of view. Overall -ENOMEM failures with d_alloc returning -ENOMEM in open(2) aren't so nice for apps, which is I'm not so fond of the container virt vs a virt where the container manages its memory and no DoS issue like this one can ever materialize for the host, and it requires no added complexity to the host. The container approach won't ever be as reliable as guest OS in avoiding these issues, so we maybe shouldn't complain that this solution isn't perfect for the inode cache, when clearly it will help their usage. > > > global lists and locks for LRU, shrinker and mob management is the > > > opposite direction we are taking - we want to make the LRUs more > > > fine-grained and more closely related to the MM structures, > > > shrinkers confined to per-sb context (no more lifecycle issues, > > > ever) and operate per-node/-zone rather than globally, etc. It > > > seems to me that this containerisation will make much of that work > > > difficult to acheive effectively because it doesn't take any of this > > > ongoing scalability work into account. > > > > Two things from my side on this: > > > > 1. Can you be more specific on this - which parts of VFS suffer from the > > LRU being global? > > Performance. It doesn't scale beyond a few CPUs before lock > contention becomes the limiting factor. The global vs per-zone/numa dentry lru vs global seems an interesting point. Probably here we've two different point of views that asks for development going into two different directions because of different production objectives and priorities. This won't be as easy to get an agreement on. Maybe I remember wrong, but I seem to recall Nick was proposing to split per-zone/numa vfs lrus too, and Christoph was against it (or maybe it was the other way around :). But it wasn't discussed in container context and I don't remember exactly what the cons were. Global usually provides better lru behavior, and splitting arbitrarily among zone/node tends to be a scalability boost but if we only see it as a lock-scalability improvement it becomes a tradeoff between better lru info and better scalability, so then we could arbitrarily split the lru without regard of the actual zone/node size. It is much better to split lrus on zone/node boundaries when there's a real need to shrink specific zones/nodes from a reclaim point of view, not just better scalability when taking the lock. We obviously use that zone/node lru split in the pagecache lru, and clearly when we've to shrink a single node it helps more than just for lock scalability, so the per-node lrus is certainly needed in NUMA setups with HARDWALL NUMA pins as it saves a ton of CPU and it avoids global lru churning as well. So maybe a per zone/node lru would provide similar benefits for vfs caches and it would be indeed the right direction for MM point of view (ignoring this very issue of containers). Right now we do blind vfs shrinks when we could do selective zone/node ones as the vfs shrinker caller has the zone/node info already. Maybe whoever was against it (regardless of this container dentry limit discussion) should point out what the cons are. > I never implied quotas were for limiting cache usage. I only > suggested they were the solution to your DOS example by preventing > unbound numbers of inodes from being created by an unprivileged > user. > > To me, it sounds like you overprovision your servers and then > have major troubles when everyone tries to use what you supplied > them with simultaneously. There is a simple solution to that. ;) > Otherwise, I think you need to directly limit the size of the inode > caches, not try to do it implicitly via 2nd and 3rd order side > effects of controlling the size of the dentry cache. They want to limit the number of simultaneously pinned amount of kernel ram structures, while still leaving huge amount of files possible in the filesystem to make life simple during install etc... So you can untar whatever size of backup into the container regardless of quotas, but if only a part of the unpacked data (common case) is used by the apps it just works. Again I don't think the objective is a perfect accounting but just something that happen to works better, if one wants perfect accounting of the memory and bytes utilized by the on-disk image there are other types of virt available. Yet another approach would be to account how much kernel data structures each process is keeping pinned and unfreeable and sum that to the process RAM during the oom killer decision, but that wouldn't be an hard per container limit and it sounds way too CPU costly to account the vfs pinned RAM every time somebody open() or chdir(), it'd require to count too many things towards the dentry root. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html