On Feb 1, 2010, at 7:25 PM, Mark Moseley wrote:
On Mon, Feb 1, 2010 at 12:54 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:On Jan 27, 2010, at 10:48 PM, Mark Moseley wrote:On Wed, Jan 13, 2010 at 2:08 PM, Mark Moseley <moseleymark@xxxxxxxxx>wrote:I'm seeing an issue similar to http://www.spinics.net/lists/linux-nfs/msg09255.html in a heavy NFS environment. The topology is all Debian Etch servers (8-core Dell 1950s) talking to a variety of Netapp filers. In trying to diagnosehigh loads and esp high 'system' CPU usage in vmstat, using the 'perf'tool from the linux distro, I can see that the "rpcauth_lookup_credcache" call is far and away the top function in'perf top'. I see similar results across ~80 servers of the same typeof service. On servers that have been up for a while,rpcauth_lookup_credcache is usually ~40-50%; looking at a box rebooted about an hour ago, rpcauth_lookup_credcache is around ~15-25%. Here'sa box that's been up for a while: ------------------------------------------------------------------------------PerfTop: 113265 irqs/sec kernel:42.7% [100000 cycles], (all, 8 CPUs)------------------------------------------------------------------------------ samples pcnt RIP kernel function ______ _______ _____ ________________ _______________359151.00 - 44.8% - 00000000003d2081 : rpcauth_lookup_credcache33414.00 - 4.2% - 000000000001b0ec : native_write_cr0 27852.00 - 3.5% - 00000000003d252c : generic_match 19254.00 - 2.4% - 0000000000092565 : sanitize_highpage 18779.00 - 2.3% - 0000000000004610 : system_call 12047.00 - 1.5% - 00000000000a137f : copy_user_highpage 11736.00 - 1.5% - 00000000003f5137 : _spin_lock 11066.00 - 1.4% - 00000000003f5420 : page_fault8981.00 - 1.1% - 000000000001b322 : native_flush_tlb_single 8490.00 - 1.1% - 000000000006c98f : audit_filter_syscall7169.00 - 0.9% - 0000000000208e43 : __copy_to_user_ll 6000.00 - 0.7% - 00000000000219c1 : kunmap_atomic 5262.00 - 0.7% - 00000000001fae02 : glob_match 4687.00 - 0.6% - 0000000000021acc : kmap_atomic_prot 4404.00 - 0.5% - 0000000000008fb2 : read_tsc I took the advice in the above thread and adjusted theRPC_CREDCACHE_HASHBITS #define in include/linux/sunrpc/auth.h to 12 --but didn't modify anything else. After doing so,rpcauth_lookup_credcache drops off the list (even when the top list iswidened to 40 lines) and 'system' CPU usage drops by quite a bit,under the same workload. And even after a day of running, it's still performing favourably, despite having the same workload and uptime as RPC_CREDCACHE_HASHBITS=4 boxes that are still struggling. Both patched and unpatched kernels are 2.6.32.3, both with grsec and ipset. Here's'perf top' of a patched box: ------------------------------------------------------------------------------PerfTop: 116525 irqs/sec kernel:27.0% [100000 cycles], (all, 8 CPUs)------------------------------------------------------------------------------ samples pcnt RIP kernel function ______ _______ _____ ________________ _______________ 15844.00 - 7.0% - 0000000000019eb2 : native_write_cr0 11479.00 - 5.0% - 00000000000934fd : sanitize_highpage 11328.00 - 5.0% - 0000000000003d10 : system_call 6578.00 - 2.9% - 00000000000a26d2 : copy_user_highpage 6417.00 - 2.8% - 00000000003fdb80 : page_fault 6237.00 - 2.7% - 00000000003fd897 : _spin_lock4732.00 - 2.1% - 000000000006d3b0 : audit_filter_syscall4504.00 - 2.0% - 000000000020cf59 : __copy_to_user_ll4309.00 - 1.9% - 000000000001a370 : native_flush_tlb_single3293.00 - 1.4% - 00000000001fefba : glob_match 2911.00 - 1.3% - 00000000003fda25 : _spin_lock_irqsave 2753.00 - 1.2% - 00000000000d30f1 : __d_lookup 2500.00 - 1.1% - 00000000000200b8 : kunmap_atomic 2418.00 - 1.1% - 0000000000008483 : read_tsc 2387.00 - 1.0% - 0000000000089a7b : perf_poll My question is, is it safe to make that change toRPC_CREDCACHE_HASHBITS, or will that lead to some overflow somewhere else in the NFS/RPC stack? Looking over the code in net/sunrpc/ auth.c,I don't see any big red flags, but I don't flatter myself into thinking I can debug kernel code, so I wanted to pose the questionhere. Is it pretty safe to change RPC_CREDCACHE_HASHBITS from 4 to 12? Or am I setting myself up for instability and/or security issues? I'drather be slow than hacked. Thanks!I've read and reread the pertinent sections of code where RPC_CREDCACHE_HASHBITS and RPC_CREDCACHE_NR (derived from RPC_CREDCACHE_HASHBITS) and it looks pretty safe. In lieu of a full sysctl-controlled setting to change RPC_CREDCACHE_HASHBITS, would it make sense to setRPC_CREDCACHE_HASHBITS to something bigger than 4 by default? I'd bet a lot of other people in high-traffic environments with a large numberof active unix accounts are likely unknowingly affected by this. I only happened to notice by playing with the kernel's perf tool. I could be wrong but it doesn't look like it'd tie up an excessive amount of memory to have, say, 256 or 1024 or 4096 hash buckets in au_credcache (though it wouldn't surprise me if I was way, way off about that). It seems (to a non-kernel guy) that the only obvious operation that would suffer due to more buckets would be rpcauth_prune_expired() in net/sunrpc/auth.c. I haven't tested this out with pre-2.6.32.x kernels, but since the default is either 16 buckets or even 8 way back in 2.6.24.x, I'm guessing that this pertains to all recent kernels.I haven't looked at the RPC cred cache in specific, but the usual Linux kernel practice is to size hash tables based on the size of the machine's physical RAM. Smaller machines are likely to need fewer entries in the cred cache, and will probably not want to take up the fixed address space for4096 buckets.4096 might be a bit much. Though since there doesn't seem to be a ceiling on the number of entries, so at least memory-wise, the only difference in overhead would just be the rest of the size of struct "hlist_head" (at least from a non-kernel-guy perspective), since it'd still have the same sum total of entries across the buckets with 16 or 256 or 4096.The real test of your hash table size is whether the hash functionadequately spreads entries across the hash buckets, for most workloads. Helpful hint: you should test using real workloads (eg. a snapshot of credentials from a real client or server), not, for instance, syntheticworkloads you made up.In production, it works pretty nicely. Since it looked pretty safe, I've been running on 1 box in a pool of 9, all with identical load-balanced workloads. The RPC_BITS-hacked box consistently spends less time in 'system' time than the other 8. The other boxes in that pool have 'perf top' stats with rpcauth_lookup_credcache in the area of 30-50% (except for right after booting up; takes a couple of hours before rpcauth_lookup_credcache starts monopolizing the output. On the RPC_Bits-hacked box, rpcauth_lookup_credcache never even shows up in the perf top 10 or 20. I could also be abusing/misinterpreting 'perf top' output :)
That's evidence that it's working better, but you need to know if there are still any buckets that contain a large number of entries, while the others contain only a few. I don't recall a mention of how many entries your systems are caching, but even with a large hash table, if most of them end up in just a few buckets, it still isn't working efficiently, even though it might be faster.
Another way to look at it is that shows we could get away with a small hash table if the hash function can be improved. It would help us to know what the specific problem is.
You could hook up a simple printk that shows how many entries are in the fullest and the emptiest bucket (for example, when doing an "echo m > /proc/sysrq-trigger", or you could have the entry counts displayed in a /proc file). If the ratio of those numbers approaches 1 when there's a large number of entries in the cache, then you know for sure the hash function is working properly for your workload.
-- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html