On Mon, Feb 1, 2010 at 12:54 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > On Jan 27, 2010, at 10:48 PM, Mark Moseley wrote: >> >> On Wed, Jan 13, 2010 at 2:08 PM, Mark Moseley <moseleymark@xxxxxxxxx> >> wrote: >>> >>> I'm seeing an issue similar to >>> http://www.spinics.net/lists/linux-nfs/msg09255.html in a heavy NFS >>> environment. The topology is all Debian Etch servers (8-core Dell >>> 1950s) talking to a variety of Netapp filers. In trying to diagnose >>> high loads and esp high 'system' CPU usage in vmstat, using the 'perf' >>> tool from the linux distro, I can see that the >>> "rpcauth_lookup_credcache" call is far and away the top function in >>> 'perf top'. I see similar results across ~80 servers of the same type >>> of service. On servers that have been up for a while, >>> rpcauth_lookup_credcache is usually ~40-50%; looking at a box rebooted >>> about an hour ago, rpcauth_lookup_credcache is around ~15-25%. Here's >>> a box that's been up for a while: >>> >>> >>> ------------------------------------------------------------------------------ >>> PerfTop: 113265 irqs/sec kernel:42.7% [100000 cycles], (all, 8 CPUs) >>> >>> ------------------------------------------------------------------------------ >>> >>> samples pcnt RIP kernel function >>> ______ _______ _____ ________________ _______________ >>> >>> 359151.00 - 44.8% - 00000000003d2081 : rpcauth_lookup_credcache >>> 33414.00 - 4.2% - 000000000001b0ec : native_write_cr0 >>> 27852.00 - 3.5% - 00000000003d252c : generic_match >>> 19254.00 - 2.4% - 0000000000092565 : sanitize_highpage >>> 18779.00 - 2.3% - 0000000000004610 : system_call >>> 12047.00 - 1.5% - 00000000000a137f : copy_user_highpage >>> 11736.00 - 1.5% - 00000000003f5137 : _spin_lock >>> 11066.00 - 1.4% - 00000000003f5420 : page_fault >>> 8981.00 - 1.1% - 000000000001b322 : native_flush_tlb_single >>> 8490.00 - 1.1% - 000000000006c98f : audit_filter_syscall >>> 7169.00 - 0.9% - 0000000000208e43 : __copy_to_user_ll >>> 6000.00 - 0.7% - 00000000000219c1 : kunmap_atomic >>> 5262.00 - 0.7% - 00000000001fae02 : glob_match >>> 4687.00 - 0.6% - 0000000000021acc : kmap_atomic_prot >>> 4404.00 - 0.5% - 0000000000008fb2 : read_tsc >>> >>> >>> I took the advice in the above thread and adjusted the >>> RPC_CREDCACHE_HASHBITS #define in include/linux/sunrpc/auth.h to 12 -- >>> but didn't modify anything else. After doing so, >>> rpcauth_lookup_credcache drops off the list (even when the top list is >>> widened to 40 lines) and 'system' CPU usage drops by quite a bit, >>> under the same workload. And even after a day of running, it's still >>> performing favourably, despite having the same workload and uptime as >>> RPC_CREDCACHE_HASHBITS=4 boxes that are still struggling. Both patched >>> and unpatched kernels are 2.6.32.3, both with grsec and ipset. Here's >>> 'perf top' of a patched box: >>> >>> >>> ------------------------------------------------------------------------------ >>> PerfTop: 116525 irqs/sec kernel:27.0% [100000 cycles], (all, 8 CPUs) >>> >>> ------------------------------------------------------------------------------ >>> >>> samples pcnt RIP kernel function >>> ______ _______ _____ ________________ _______________ >>> >>> 15844.00 - 7.0% - 0000000000019eb2 : native_write_cr0 >>> 11479.00 - 5.0% - 00000000000934fd : sanitize_highpage >>> 11328.00 - 5.0% - 0000000000003d10 : system_call >>> 6578.00 - 2.9% - 00000000000a26d2 : copy_user_highpage >>> 6417.00 - 2.8% - 00000000003fdb80 : page_fault >>> 6237.00 - 2.7% - 00000000003fd897 : _spin_lock >>> 4732.00 - 2.1% - 000000000006d3b0 : audit_filter_syscall >>> 4504.00 - 2.0% - 000000000020cf59 : __copy_to_user_ll >>> 4309.00 - 1.9% - 000000000001a370 : native_flush_tlb_single >>> 3293.00 - 1.4% - 00000000001fefba : glob_match >>> 2911.00 - 1.3% - 00000000003fda25 : _spin_lock_irqsave >>> 2753.00 - 1.2% - 00000000000d30f1 : __d_lookup >>> 2500.00 - 1.1% - 00000000000200b8 : kunmap_atomic >>> 2418.00 - 1.1% - 0000000000008483 : read_tsc >>> 2387.00 - 1.0% - 0000000000089a7b : perf_poll >>> >>> >>> My question is, is it safe to make that change to >>> RPC_CREDCACHE_HASHBITS, or will that lead to some overflow somewhere >>> else in the NFS/RPC stack? Looking over the code in net/sunrpc/auth.c, >>> I don't see any big red flags, but I don't flatter myself into >>> thinking I can debug kernel code, so I wanted to pose the question >>> here. Is it pretty safe to change RPC_CREDCACHE_HASHBITS from 4 to 12? >>> Or am I setting myself up for instability and/or security issues? I'd >>> rather be slow than hacked. >>> >>> Thanks! >>> >> >> I've read and reread the pertinent sections of code where >> RPC_CREDCACHE_HASHBITS and RPC_CREDCACHE_NR (derived from >> RPC_CREDCACHE_HASHBITS) and it looks pretty safe. >> >> In lieu of a full sysctl-controlled setting to change >> RPC_CREDCACHE_HASHBITS, would it make sense to set >> RPC_CREDCACHE_HASHBITS to something bigger than 4 by default? I'd bet >> a lot of other people in high-traffic environments with a large number >> of active unix accounts are likely unknowingly affected by this. I >> only happened to notice by playing with the kernel's perf tool. >> >> I could be wrong but it doesn't look like it'd tie up an excessive >> amount of memory to have, say, 256 or 1024 or 4096 hash buckets in >> au_credcache (though it wouldn't surprise me if I was way, way off >> about that). It seems (to a non-kernel guy) that the only obvious >> operation that would suffer due to more buckets would be >> rpcauth_prune_expired() in net/sunrpc/auth.c. I haven't tested this >> out with pre-2.6.32.x kernels, but since the default is either 16 >> buckets or even 8 way back in 2.6.24.x, I'm guessing that this >> pertains to all recent kernels. > > I haven't looked at the RPC cred cache in specific, but the usual Linux > kernel practice is to size hash tables based on the size of the machine's > physical RAM. Smaller machines are likely to need fewer entries in the cred > cache, and will probably not want to take up the fixed address space for > 4096 buckets. 4096 might be a bit much. Though since there doesn't seem to be a ceiling on the number of entries, so at least memory-wise, the only difference in overhead would just be the rest of the size of struct "hlist_head" (at least from a non-kernel-guy perspective), since it'd still have the same sum total of entries across the buckets with 16 or 256 or 4096. > The real test of your hash table size is whether the hash function > adequately spreads entries across the hash buckets, for most workloads. > Helpful hint: you should test using real workloads (eg. a snapshot of > credentials from a real client or server), not, for instance, synthetic > workloads you made up. In production, it works pretty nicely. Since it looked pretty safe, I've been running on 1 box in a pool of 9, all with identical load-balanced workloads. The RPC_BITS-hacked box consistently spends less time in 'system' time than the other 8. The other boxes in that pool have 'perf top' stats with rpcauth_lookup_credcache in the area of 30-50% (except for right after booting up; takes a couple of hours before rpcauth_lookup_credcache starts monopolizing the output. On the RPC_Bits-hacked box, rpcauth_lookup_credcache never even shows up in the perf top 10 or 20. I could also be abusing/misinterpreting 'perf top' output :) > If the current hash table is small (did you say it was only four buckets?) > then the existing hash function probably hasn't been really exercised > appropriately to see if it actually works well on a large hash table. In new kernels, it's 4 bits, i.e. 16 buckets. In older kernels (look at 2.6.24.x), it looks like it was 8 buckets total. > If the hash function is working adequately, a 256 bucket hash table (or even > smaller) is probably adequate even for a few thousand entries. The next kernel I'll roll I'll do 8 bits and report results back to this thread. >> Let me know too if this would be better addressed on the kernel list. >> I'm just assuming since it's nfs-related that this would be the spot >> for it, but I don't know if purely RPC-related things would end up >> here too. Thanks! > > I think this is the correct mailing list for this topic. Cool, good to know. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html