http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ Update to vfs scalability patches: - Lots of fixes, particularly RCU inode stuff - Lots of cleanups and aesthetic to the code, ifdef reduction etc - Use bit locks for inode and dentry hashes - Small improvements to single-threaded performance - Split inode LRU and writeback list locking - Per-bdi inode writeback list locking - Per-zone mm shrinker - Per-zone dentry and inode LRU lists - Several fixes brought in from -rt tree testing - No global locks remain in any fastpaths (arguably, rename) I have not included the store-free path walk patches in this posting. They require a bit more work and they will need to be reworked after ->d_revalidate/->follow_mount changes that Al wants to do. I prefer to concentrate on these locking patches first. Autofs4 is sadly missing. It's a bit tricky, patches have to be reworked. Performance: Last time I was testing on a 32-node Altix which could be considered as not a sweet-spot for Linux performance target (ie. improvements there may not justify complexity). So recently I've been testing with a tightly interconnected 4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of system. *** Single-thread microbenchmark (simple syscall loops, lower is better): Test Difference at 95.0% confidence (50 runs) open/close -6.07% +/- 1.075% creat/unlink 27.83% +/- 0.522% Open/close is a little faster, which should be due to one less atomic in the dput common case. Creat/unlink is significantly slower, which is due to RCU freeing inodes. We have have made the same magnitude of performance regression tradeoff when going to RCU freed dentries and files as well. Inode RCU is required for reducing inode hash lookup locking and improve lock ordering, also for store-free path-walk. *** Let's take a look at this creat/unlink regression more closely. If we call rdtsc around the creat/unlink loop, and just run it once (so as to avoid much of the RCU induced problems): vanilla: 5328 cycles vfs: 5960 cycles (+11.8%) Not so bad when RCU is not being stressed. *** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs): vanilla vfs real 0m4.911s 0m0.183s user 0m1.920s 0m1.610s sys 4m58.670s 0m5.770s After vfs patches, 26x increase in throughput, however parallelism is limited by test spawning and exit phases. sys time improvement shows closer to 50x improvement. vanilla is bottlenecked on dcache_lock. *** Google sockets (http://marc.info/?l=linux-kernel&m=123215942507568&w=2): vanilla vfs real 1m 7.774s 0m 3.245s user 0m19.230s 0m36.750s sys 71m41.310s 2m47.320s do_exit path for the run took 24.755s 1.219s After vfs patches, 20x increase in throughput for both the total duration and the do_exit (teardown) time. *** file-ops test (people.redhat.com/mingo/file-ops-test/file-ops-test.c) Parallel open/close or creat/unlink in same or different cwds within the same ramfs mount. Relative throughput percentages are given at each parallelism point (higher is better): open/close vanilla vfs same cwd 1 100.0 119.1 2 74.2 187.4 4 38.4 40.9 8 18.7 27.0 16 9.0 24.9 32 5.9 24.2 64 6.0 27.7 different cwd 1 100.0 119.1 2 133.0 238.6 4 21.2 488.6 8 19.7 932.6 16 18.8 1784.1 32 18.7 3469.5 64 19.0 2858.0 creat/unlink vanilla vfs same cwd 1 100.0 75.0 2 44.1 41.8 4 28.7 24.6 8 16.5 14.2 16 8.7 8.9 32 5.5 7.8 64 5.9 7.4 different cwd 1 100.0 75.0 2 89.8 137.2 4 20.1 267.4 8 17.2 513.0 16 16.2 901.9 32 15.8 1724.0 64 17.3 1161.8 Note that at 64, we start using sibling threads on the CPU, making results jump around a bit. The drop at 64 in different-cwd cases seems to be hitting an RCU or slab allocator issue (or maybe it's just the SMT). The scalability regression I was seeing in same-cwd tests is no longer there (is even improved now). It may still be present in some workloads doing common-element path lookups. This can be solved by making d_count atomic again, at the cost of more atomic ops in some cases, but scalability is still limited. So I prefer to do store-free path walking which is much more scalable. In the different cwd open/close case, cost to bounce cachelines over the interconnect is putting absolute upper limit of 162K open/closes per second over the entire machine in vanilla kernel. After vfs patches, it is around 30M. On larger and less well connected machines, the lower limit will only get lower while the vfs case should continue to keep going up (assuming mm subsystem can keep up). *** Reclaim I have not done much reclaim testing yet. It should be more scalable and lower latency due to significant reduction in lru locks interfering with other critical sections in inode/dentry code, and because we have per-zone locks. Per-zone LRUs mean that reclaim is targetted to the correct zone, and that kswapd will operate on lists of node-local memory objects. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html