On Tue, Feb 2, 2010 at 9:10 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > On Feb 1, 2010, at 7:25 PM, Mark Moseley wrote: >> >> On Mon, Feb 1, 2010 at 12:54 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> >> wrote: >>> >>> On Jan 27, 2010, at 10:48 PM, Mark Moseley wrote: >>>> >>>> On Wed, Jan 13, 2010 at 2:08 PM, Mark Moseley <moseleymark@xxxxxxxxx> >>>> wrote: >>>>> >>>>> I'm seeing an issue similar to >>>>> http://www.spinics.net/lists/linux-nfs/msg09255.html in a heavy NFS >>>>> environment. The topology is all Debian Etch servers (8-core Dell >>>>> 1950s) talking to a variety of Netapp filers. In trying to diagnose >>>>> high loads and esp high 'system' CPU usage in vmstat, using the 'perf' >>>>> tool from the linux distro, I can see that the >>>>> "rpcauth_lookup_credcache" call is far and away the top function in >>>>> 'perf top'. I see similar results across ~80 servers of the same type >>>>> of service. On servers that have been up for a while, >>>>> rpcauth_lookup_credcache is usually ~40-50%; looking at a box rebooted >>>>> about an hour ago, rpcauth_lookup_credcache is around ~15-25%. Here's >>>>> a box that's been up for a while: >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> PerfTop: 113265 irqs/sec kernel:42.7% [100000 cycles], (all, 8 >>>>> CPUs) >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> samples pcnt RIP kernel function >>>>> ______ _______ _____ ________________ _______________ >>>>> >>>>> 359151.00 - 44.8% - 00000000003d2081 : rpcauth_lookup_credcache >>>>> 33414.00 - 4.2% - 000000000001b0ec : native_write_cr0 >>>>> 27852.00 - 3.5% - 00000000003d252c : generic_match >>>>> 19254.00 - 2.4% - 0000000000092565 : sanitize_highpage >>>>> 18779.00 - 2.3% - 0000000000004610 : system_call >>>>> 12047.00 - 1.5% - 00000000000a137f : copy_user_highpage >>>>> 11736.00 - 1.5% - 00000000003f5137 : _spin_lock >>>>> 11066.00 - 1.4% - 00000000003f5420 : page_fault >>>>> 8981.00 - 1.1% - 000000000001b322 : native_flush_tlb_single >>>>> 8490.00 - 1.1% - 000000000006c98f : audit_filter_syscall >>>>> 7169.00 - 0.9% - 0000000000208e43 : __copy_to_user_ll >>>>> 6000.00 - 0.7% - 00000000000219c1 : kunmap_atomic >>>>> 5262.00 - 0.7% - 00000000001fae02 : glob_match >>>>> 4687.00 - 0.6% - 0000000000021acc : kmap_atomic_prot >>>>> 4404.00 - 0.5% - 0000000000008fb2 : read_tsc >>>>> >>>>> >>>>> I took the advice in the above thread and adjusted the >>>>> RPC_CREDCACHE_HASHBITS #define in include/linux/sunrpc/auth.h to 12 -- >>>>> but didn't modify anything else. After doing so, >>>>> rpcauth_lookup_credcache drops off the list (even when the top list is >>>>> widened to 40 lines) and 'system' CPU usage drops by quite a bit, >>>>> under the same workload. And even after a day of running, it's still >>>>> performing favourably, despite having the same workload and uptime as >>>>> RPC_CREDCACHE_HASHBITS=4 boxes that are still struggling. Both patched >>>>> and unpatched kernels are 2.6.32.3, both with grsec and ipset. Here's >>>>> 'perf top' of a patched box: >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> PerfTop: 116525 irqs/sec kernel:27.0% [100000 cycles], (all, 8 >>>>> CPUs) >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> samples pcnt RIP kernel function >>>>> ______ _______ _____ ________________ _______________ >>>>> >>>>> 15844.00 - 7.0% - 0000000000019eb2 : native_write_cr0 >>>>> 11479.00 - 5.0% - 00000000000934fd : sanitize_highpage >>>>> 11328.00 - 5.0% - 0000000000003d10 : system_call >>>>> 6578.00 - 2.9% - 00000000000a26d2 : copy_user_highpage >>>>> 6417.00 - 2.8% - 00000000003fdb80 : page_fault >>>>> 6237.00 - 2.7% - 00000000003fd897 : _spin_lock >>>>> 4732.00 - 2.1% - 000000000006d3b0 : audit_filter_syscall >>>>> 4504.00 - 2.0% - 000000000020cf59 : __copy_to_user_ll >>>>> 4309.00 - 1.9% - 000000000001a370 : native_flush_tlb_single >>>>> 3293.00 - 1.4% - 00000000001fefba : glob_match >>>>> 2911.00 - 1.3% - 00000000003fda25 : _spin_lock_irqsave >>>>> 2753.00 - 1.2% - 00000000000d30f1 : __d_lookup >>>>> 2500.00 - 1.1% - 00000000000200b8 : kunmap_atomic >>>>> 2418.00 - 1.1% - 0000000000008483 : read_tsc >>>>> 2387.00 - 1.0% - 0000000000089a7b : perf_poll >>>>> >>>>> >>>>> My question is, is it safe to make that change to >>>>> RPC_CREDCACHE_HASHBITS, or will that lead to some overflow somewhere >>>>> else in the NFS/RPC stack? Looking over the code in net/sunrpc/auth.c, >>>>> I don't see any big red flags, but I don't flatter myself into >>>>> thinking I can debug kernel code, so I wanted to pose the question >>>>> here. Is it pretty safe to change RPC_CREDCACHE_HASHBITS from 4 to 12? >>>>> Or am I setting myself up for instability and/or security issues? I'd >>>>> rather be slow than hacked. >>>>> >>>>> Thanks! >>>>> >>>> >>>> I've read and reread the pertinent sections of code where >>>> RPC_CREDCACHE_HASHBITS and RPC_CREDCACHE_NR (derived from >>>> RPC_CREDCACHE_HASHBITS) and it looks pretty safe. >>>> >>>> In lieu of a full sysctl-controlled setting to change >>>> RPC_CREDCACHE_HASHBITS, would it make sense to set >>>> RPC_CREDCACHE_HASHBITS to something bigger than 4 by default? I'd bet >>>> a lot of other people in high-traffic environments with a large number >>>> of active unix accounts are likely unknowingly affected by this. I >>>> only happened to notice by playing with the kernel's perf tool. >>>> >>>> I could be wrong but it doesn't look like it'd tie up an excessive >>>> amount of memory to have, say, 256 or 1024 or 4096 hash buckets in >>>> au_credcache (though it wouldn't surprise me if I was way, way off >>>> about that). It seems (to a non-kernel guy) that the only obvious >>>> operation that would suffer due to more buckets would be >>>> rpcauth_prune_expired() in net/sunrpc/auth.c. I haven't tested this >>>> out with pre-2.6.32.x kernels, but since the default is either 16 >>>> buckets or even 8 way back in 2.6.24.x, I'm guessing that this >>>> pertains to all recent kernels. >>> >>> I haven't looked at the RPC cred cache in specific, but the usual Linux >>> kernel practice is to size hash tables based on the size of the machine's >>> physical RAM. Smaller machines are likely to need fewer entries in the >>> cred >>> cache, and will probably not want to take up the fixed address space for >>> 4096 buckets. >> >> 4096 might be a bit much. Though since there doesn't seem to be a >> ceiling on the number of entries, so at least memory-wise, the only >> difference in overhead would just be the rest of the size of struct >> "hlist_head" (at least from a non-kernel-guy perspective), since it'd >> still have the same sum total of entries across the buckets with 16 or >> 256 or 4096. >> >>> The real test of your hash table size is whether the hash function >>> adequately spreads entries across the hash buckets, for most workloads. >>> Helpful hint: you should test using real workloads (eg. a snapshot of >>> credentials from a real client or server), not, for instance, synthetic >>> workloads you made up. >> >> In production, it works pretty nicely. Since it looked pretty safe, >> I've been running on 1 box in a pool of 9, all with identical >> load-balanced workloads. The RPC_BITS-hacked box consistently spends >> less time in 'system' time than the other 8. The other boxes in that >> pool have 'perf top' stats with rpcauth_lookup_credcache in the area >> of 30-50% (except for right after booting up; takes a couple of hours >> before rpcauth_lookup_credcache starts monopolizing the output. On the >> RPC_Bits-hacked box, rpcauth_lookup_credcache never even shows up in >> the perf top 10 or 20. I could also be abusing/misinterpreting 'perf >> top' output :) > > That's evidence that it's working better, but you need to know if there are > still any buckets that contain a large number of entries, while the others > contain only a few. I don't recall a mention of how many entries your > systems are caching, but even with a large hash table, if most of them end > up in just a few buckets, it still isn't working efficiently, even though it > might be faster. I had actually meant (and forgotten) to ask in this thread if there was a way to determine the bucket membership counts. I haven't been able to find anything in /proc that looks promising, nor does it looks like it's updating any sort of counters. As far as numbers in buckets, without a counter it's hard to tell, but at least in the hundreds, probably into the thousands. Given enough egregious directory walks by end-users' scripts could push it even higher. > Another way to look at it is that shows we could get away with a small hash > table if the hash function can be improved. It would help us to know what > the specific problem is. > > You could hook up a simple printk that shows how many entries are in the > fullest and the emptiest bucket (for example, when doing an "echo m > > /proc/sysrq-trigger", or you could have the entry counts displayed in a > /proc file). If the ratio of those numbers approaches 1 when there's a > large number of entries in the cache, then you know for sure the hash > function is working properly for your workload. I don't rate my C remotely good enough to competently modify any kernel code (beyond changing a constant) :) Do you know of any examples that I could rip out and plug in here? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html