Re: Is it safe to increase RPC_CREDCACHE_HASHBITS?

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 2 Feb 2010 12:10:54 -0500

On Feb 1, 2010, at 7:25 PM, Mark Moseley wrote:
On Mon, Feb 1, 2010 at 12:54 PM, Chuck Lever  
<chuck.lever@xxxxxxxxxx> wrote:
On Jan 27, 2010, at 10:48 PM, Mark Moseley wrote:

On Wed, Jan 13, 2010 at 2:08 PM, Mark Moseley  
<moseleymark@xxxxxxxxx>
wrote:

I'm seeing an issue similar to
http://www.spinics.net/lists/linux-nfs/msg09255.html in a heavy NFS
environment. The topology is all Debian Etch servers (8-core Dell
1950s) talking to a variety of Netapp filers. In trying to diagnose
high loads and esp high 'system' CPU usage in vmstat, using the  
'perf'
tool from the linux distro, I can see that the
"rpcauth_lookup_credcache" call is far and away the top function in
'perf top'. I see similar results across ~80 servers of the same  
type
of service. On servers that have been up for a while,
rpcauth_lookup_credcache is usually ~40-50%; looking at a box  
rebooted
about an hour ago, rpcauth_lookup_credcache is around ~15-25%.  
Here's
a box that's been up for a while:

------------------------------------------------------------------------------
 PerfTop:  113265 irqs/sec  kernel:42.7% [100000 cycles],  (all,  
8 CPUs)

------------------------------------------------------------------------------

           samples    pcnt         RIP          kernel function
 ______     _______   _____   ________________   _______________

         359151.00 - 44.8% - 00000000003d2081 :  
rpcauth_lookup_credcache
          33414.00 -  4.2% - 000000000001b0ec : native_write_cr0
          27852.00 -  3.5% - 00000000003d252c : generic_match
          19254.00 -  2.4% - 0000000000092565 : sanitize_highpage
          18779.00 -  2.3% - 0000000000004610 : system_call
          12047.00 -  1.5% - 00000000000a137f : copy_user_highpage
          11736.00 -  1.5% - 00000000003f5137 : _spin_lock
          11066.00 -  1.4% - 00000000003f5420 : page_fault
           8981.00 -  1.1% - 000000000001b322 :  
native_flush_tlb_single
           8490.00 -  1.1% - 000000000006c98f :  
audit_filter_syscall
           7169.00 -  0.9% - 0000000000208e43 : __copy_to_user_ll
           6000.00 -  0.7% - 00000000000219c1 : kunmap_atomic
           5262.00 -  0.7% - 00000000001fae02 : glob_match
           4687.00 -  0.6% - 0000000000021acc : kmap_atomic_prot
           4404.00 -  0.5% - 0000000000008fb2 : read_tsc

I took the advice in the above thread and adjusted the
RPC_CREDCACHE_HASHBITS #define in include/linux/sunrpc/auth.h to  
12 --
but didn't modify anything else. After doing so,
rpcauth_lookup_credcache drops off the list (even when the top  
list is
widened to 40 lines) and 'system' CPU usage drops by quite a bit,
under the same workload. And even after a day of running, it's  
still
performing favourably, despite having the same workload and  
uptime as
RPC_CREDCACHE_HASHBITS=4 boxes that are still struggling. Both  
patched
and unpatched kernels are 2.6.32.3, both with grsec and ipset.  
Here's
'perf top' of a patched box:

------------------------------------------------------------------------------
 PerfTop:  116525 irqs/sec  kernel:27.0% [100000 cycles],  (all,  
8 CPUs)

------------------------------------------------------------------------------

           samples    pcnt         RIP          kernel function
 ______     _______   _____   ________________   _______________

          15844.00 -  7.0% - 0000000000019eb2 : native_write_cr0
          11479.00 -  5.0% - 00000000000934fd : sanitize_highpage
          11328.00 -  5.0% - 0000000000003d10 : system_call
           6578.00 -  2.9% - 00000000000a26d2 : copy_user_highpage
           6417.00 -  2.8% - 00000000003fdb80 : page_fault
           6237.00 -  2.7% - 00000000003fd897 : _spin_lock
           4732.00 -  2.1% - 000000000006d3b0 :  
audit_filter_syscall
           4504.00 -  2.0% - 000000000020cf59 : __copy_to_user_ll
           4309.00 -  1.9% - 000000000001a370 :  
native_flush_tlb_single
           3293.00 -  1.4% - 00000000001fefba : glob_match
           2911.00 -  1.3% - 00000000003fda25 : _spin_lock_irqsave
           2753.00 -  1.2% - 00000000000d30f1 : __d_lookup
           2500.00 -  1.1% - 00000000000200b8 : kunmap_atomic
           2418.00 -  1.1% - 0000000000008483 : read_tsc
           2387.00 -  1.0% - 0000000000089a7b : perf_poll

My question is, is it safe to make that change to
RPC_CREDCACHE_HASHBITS, or will that lead to some overflow  
somewhere
else in the NFS/RPC stack? Looking over the code in net/sunrpc/ 
auth.c,
I don't see any big red flags, but I don't flatter myself into
thinking I can debug kernel code, so I wanted to pose the question
here. Is it pretty safe to change RPC_CREDCACHE_HASHBITS from 4  
to 12?
Or am I setting myself up for instability and/or security issues?  
I'd
rather be slow than hacked.

Thanks!

I've read and reread the pertinent sections of code where
RPC_CREDCACHE_HASHBITS and RPC_CREDCACHE_NR (derived from
RPC_CREDCACHE_HASHBITS) and it looks pretty safe.

In lieu of a full sysctl-controlled setting to change
RPC_CREDCACHE_HASHBITS, would it make sense to set
RPC_CREDCACHE_HASHBITS to something bigger than 4 by default? I'd  
bet
a lot of other people in high-traffic environments with a large  
number
of active unix accounts are likely unknowingly affected by this. I
only happened to notice by playing with the kernel's perf tool.

I could be wrong but it doesn't look like it'd tie up an excessive
amount of memory to have, say, 256 or 1024 or 4096 hash buckets in
au_credcache (though it wouldn't surprise me if I was way, way off
about that). It seems (to a non-kernel guy) that the only obvious
operation that would suffer due to more buckets would be
rpcauth_prune_expired() in net/sunrpc/auth.c. I haven't tested this
out with pre-2.6.32.x kernels, but since the default is either 16
buckets or even 8 way back in 2.6.24.x, I'm guessing that this
pertains to all recent kernels.

I haven't looked at the RPC cred cache in specific, but the usual  
Linux
kernel practice is to size hash tables based on the size of the  
machine's
physical RAM.  Smaller machines are likely to need fewer entries in  
the cred
cache, and will probably not want to take up the fixed address  
space for
4096 buckets.

4096 might be a bit much. Though since there doesn't seem to be a
ceiling on the number of entries, so at least memory-wise, the only
difference in overhead would just be the rest of the size of struct
"hlist_head" (at least from a non-kernel-guy perspective), since it'd
still have the same sum total of entries across the buckets with 16 or
256 or 4096.

The real test of your hash table size is whether the hash function
adequately spreads entries across the hash buckets, for most  
workloads.
 Helpful hint: you should test using real workloads (eg. a snapshot  
of
credentials from a real client or server), not, for instance,  
synthetic
workloads you made up.

In production, it works pretty nicely. Since it looked pretty safe,
I've been running on 1 box in a pool of 9, all with identical
load-balanced workloads. The RPC_BITS-hacked box consistently spends
less time in 'system' time than the other 8. The other boxes in that
pool have 'perf top' stats with rpcauth_lookup_credcache in the area
of 30-50% (except for right after booting up; takes a couple of hours
before rpcauth_lookup_credcache starts monopolizing the output. On the
RPC_Bits-hacked box, rpcauth_lookup_credcache never even shows up in
the perf top 10 or 20. I could also be abusing/misinterpreting 'perf
top' output :)

That's evidence that it's working better, but you need to know if  
there are still any buckets that contain a large number of entries,  
while the others contain only a few.  I don't recall a mention of how  
many entries your systems are caching, but even with a large hash  
table, if most of them end up in just a few buckets, it still isn't  
working efficiently, even though it might be faster.

Another way to look at it is that shows we could get away with a small  
hash table if the hash function can be improved.  It would help us to  
know what the specific problem is.

You could hook up a simple printk that shows how many entries are in  
the fullest and the emptiest bucket (for example, when doing an "echo  
m > /proc/sysrq-trigger", or you could have the entry counts displayed  
in a /proc file).  If the ratio of those numbers approaches 1 when  
there's a large number of entries in the cache, then you know for sure  
the hash function is working properly for your workload.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html