Hi Glauber, Here's a working version of my patchset for generic LRU lists and NUMA-aware shrinkers. There are several parts to this patch set. The NUMA aware shrinkers are based on having a generic node-based LRU list implementation, and there are subsystems that need to be converted to use these lists as part of the process. There is also a long overdue change to the shrinker API to give it separate object count and object scan callbacks, getting rid of the magic and confusing "nr_to_scan = 0" semantics. First of all, the patch set is based on a current 3.7-rc7 tree with the current xfs dev tree merged into it [can be found at git://oss.sgi.com/xfs/xfs]. That's because there are lots of XFS changes in the patch set, and theres no way I'm going to write them a second time in a couple of weeks when the current dev tree is merged into 3.8-rc1.... So, where's what the patches do: [PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu [PATCH 02/19] dentry: move to per-sb LRU locks [PATCH 03/19] dcache: remove dentries from LRU before putting on These three patches are preparation of the dcache for moving to the generic LRU list API. it basically gets rid of the global dentry LRU lock, and in doing so has to avoid several creative abuses of the lru list detection to allow dentries on shrink lists to be still magically be on the LRU list. The main change here is that now dentries on the shrink lists *must* have the DCACHE_SHRINK_LIST flag set and be entirely removed from the LRU before being disposed of. This is probably a good cleanup to do regardless of the rest of the patch set because it removes a couple of landmines in shrink_dentry_list() that took me a while to work out... [PATCH 04/19] mm: new shrinker API [PATCH 05/19] shrinker: convert superblock shrinkers to new API These introduce the count/scan shrinker API, and for testing purposes convert the superblock shrinker to use it before any other changes are made. This gives a clean separation of counting the number of objects in a cache for pressure calculations, and the act of scanning objects in an attempt to free memory. Indeed, the scan_objects() callback now returns the number of objects freed by the scan instead of having to try to work out whether any progress was made by comparing absolute counts. This is also more efficient as we don't have to count all the objects in a cache on every scan pass. It is now done once per shrink_slab() invocation to calculate how much to scan, and we get direct feedback on how much gets reclaimed in that pass. i.e. we get reliable, accurate feedback on shrinker progress. [PATCH 06/19] list: add a new LRU list type [PATCH 07/19] inode: convert inode lru list to generic lru list [PATCH 08/19] dcache: convert to use new lru list infrastructure These add the generic LRU list API and infrastructure and convert the inode and dentry caches to use it. This is still just a single global list per LRU at this point, so it's really only changing the where the LRU implemenation is rather than the fundamental algorithm. It does, however, introduce a new method of walking the LRU lists and building the dispose list of items for shrinkers, but because we are still dealing with a global list the algorithmic changes are minimal. [PATCH 09/19] list_lru: per-node list infrastructure This makes the generic LRU list much more scalable by changing it to a {list,lock,count} tuple per node. There are no external API changes to this changeover, so is transparent to current users. [PATCH 10/19] shrinker: add node awareness [PATCH 11/19] fs: convert inode and dentry shrinking to be node Adds a nodemask to the struct shrink_control for callers of shrink_slab to set appropriately for their reclaim context. This nodemask is then passed by the inode and dentry cache reclaim code to the generic LRU list code to implement node aware shrinking. What this doesn't do is convert the internal shrink_slab() algorithm to be node aware. I'm not sure what the best approach is here, but it strikes me that it should really be calculating and keeping track of scan counts and pressure on a per-node basis. The current code seems to work OK at the moment, though. [PATCH 12/19] xfs: convert buftarg LRU to generic code [PATCH 13/19] xfs: Node aware direct inode reclaim [PATCH 14/19] xfs: use generic AG walk for background inode reclaim [PATCH 15/19] xfs: convert dquot cache lru to list_lru These patches convert all the XFS LRUs and shrinkers to be node aware. This gets rid of a lot of hairy, special case code in the inode cache shrinker for avoiding concurrent shrinker contention and to throttle direct reclaim to prevent premature OOM conditions. Removing this code greatly simplifies inode cache reclaim whilst reducing overhead and improving performance. In all, it converts three separate caches and shrinkers to use the generic LRU lists and pass nodemasks around appropriately. This is how I've really tested the code - lots of interesting filesystem workloads that generate simultaneous slab and page cache pressure on VM's with and without fake_numa configs.... [PATCH 16/19] fs: convert fs shrinkers to new scan/count API [PATCH 17/19] drivers: convert shrinkers to new count/scan API [PATCH 18/19] shrinker: convert remaining shrinkers to count/scan [PATCH 19/19] shrinker: Kill old ->shrink API. These last three patches convert all the other shrinker implementations to the new count/scan API. The fs, android and dm shrinkers are pretty well behaved and are implemented as is expected for there intended purposes. The driver and staging code, however, is basically a bunch of hacks to try to do something resembling reclaim when a shrinker tells it there is memory pressure. Looking at all the driver and staging code is not an exercise I recommend if you value your eyes and/or your sanity. I haven't even tried to compile this on a CONFIG_SMP=n configuration, nor have I done extensive allmod style build tests and it's only been built and tested on x86-64. That said, apart from CONFIG_SMP=n, I don't see there being any major problems here. There's still a bunch of cleanup work needed. e.g. the LRU list walk/isolation code needs to use enums for the isolate callback return code, there needs to be a generic list_lru_for_each() style function for walking all the objects in the cache (which will allow the list_lru structures to be used for things like the per-sb inode list). Indeed, even the name "list_lru" is probably something that should be changed - I think the list has become more of a general per-node list than it's initial functionality as a scalable LRU list implementation and I can see uses for it outside of LRUs... Comments, thoughts and flames all welcome. Cheers, Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html