On Thu, Jun 12, 2008 at 02:54:09PM -0500, Weathers, Norman R. wrote: > > > > -----Original Message----- > > From: linux-nfs-owner@xxxxxxxxxxxxxxx > > [mailto:linux-nfs-owner@xxxxxxxxxxxxxxx] On Behalf Of J. Bruce Fields > > Sent: Wednesday, June 11, 2008 5:55 PM > > To: Weathers, Norman R. > > Cc: Jeff Layton; linux-kernel@xxxxxxxxxxxxxxx; > > linux-nfs@xxxxxxxxxxxxxxx > > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger? > > > > On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, Norman R. wrote: > > > I will try and get it patched and retested, but it may be a > > day or two > > > before I can get back the information due to production jobs now > > > running. Once they finish up, I will get back with the info. > > > > Understood. > > > > > I was able to get my big user to cooperate and let me in to be able to > get the information that you were needing. The full output from the > /proc/slab_allocator file is at > http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 . The 16 > thread case is very interesting. Also, there is a small txt file in the > directory that has some rpc errors, but I imagine the way that I am > running the box (oversubscribed threads) has more to do with the rpc > errors than anything else. For those of you wanting the gist of the > story, the size-4096 slab has the following very large allocation: > > size-4096: 2 sys_init_module+0x140b/0x1980 > size-4096: 1 __vmalloc_area_node+0x188/0x1b0 > size-4096: 1 seq_read+0x1d9/0x2e0 > size-4096: 1 slabstats_open+0x2b/0x80 > size-4096: 5 vc_allocate+0x167/0x190 > size-4096: 3 input_allocate_device+0x12/0x80 > size-4096: 1 hid_add_field+0x122/0x290 > size-4096: 9 reqsk_queue_alloc+0x5f/0xf0 > size-4096: 1846825 __alloc_skb+0x7d/0x170 > size-4096: 3 alloc_netdev+0x33/0xa0 > size-4096: 10 neigh_sysctl_register+0x52/0x2b0 > size-4096: 5 devinet_sysctl_register+0x28/0x110 > size-4096: 1 pidmap_init+0x15/0x60 > size-4096: 1 netlink_proto_init+0x44/0x190 > size-4096: 1 ip_rt_init+0xfd/0x2f0 > size-4096: 1 cipso_v4_init+0x13/0x70 > size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd] > size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd] > size-4096: 2 journal_init_inode+0x84/0x150 [jbd] > size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2] > size-4096: 1 joydev_connect+0x53/0x390 [joydev] > size-4096: 13 kmem_alloc+0xb3/0x100 [xfs] > size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6] > size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc] > size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc] > size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc] > size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc] > size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd] > > The big one seems to be the __alloc_skb. (This is with 16 threads, and > it says that we are using up somewhere between 12 and 14 GB of memory, > about 2 to 3 gig of that is disk cache). If I were to put anymore > threads out there, the server would become almost unresponsive (it was > bad enough as it was). > > At the same time, I also noticed this: > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170 > > Don't know for sure if that is meaningful or not.... OK, so, starting at net/core/skbuff.c, this means that this memory was allocated by __alloc_skb() calls with something nonzero in the third ("fclone") argument. The only such caller is alloc_skb_fclone(). Callers of alloc_skb_fclone() include: sk_stream_alloc_skb: do_tcp_sendpages tcp_sendmsg tcp_fragment tso_fragment tcp_mtu_probe tcp_send_fin tcp_connect buf_acquire: lots of callers in tipc code (whatever that is). So unless you're using tipc, or you have something in userspace going haywire (perhaps netstat would help rule that out?), then I suppose there's something wrong with knfsd's tcp code. Which makes sense, I guess. I'd think this sort of allocation would be limited by the number of sockets times the size of the send and receive buffers. svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of sockets to (nrthreads+3)*20. (You aren't hitting the "too many open connections" printk there, are you?) The total buffer size should be bounded by something like 4 megs. --b. > > > > > > Thanks everyone for looking at this, by the way! > > > > And thanks for your persistence. > > > > --b. > > > > > Anytime. This is the part of the job that is fun (except for my > users...). Anyone can watch a system run, it's dealing with the unknown > that makes it interesting. OK! Because I'm a bit stuck, so this will take some more work.... --b. > > > Norman Weathers > > > > > > > > > > > > > > > > > diff --git a/mm/slab.c b/mm/slab.c > > > > index 06236e4..b379e31 100644 > > > > --- a/mm/slab.c > > > > +++ b/mm/slab.c > > > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name, > > > > size_t size, size_t align, > > > > * above the next power of two: caches with object > > > > sizes just above a > > > > * power of two have a significant amount of internal > > > > fragmentation. > > > > */ > > > > - if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN + > > > > + if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN + > > > > 2 * > > > > sizeof(unsigned long long))) > > > > flags |= SLAB_RED_ZONE | SLAB_STORE_USER; > > > > if (!(flags & SLAB_DESTROY_BY_RCU)) > > > > > > > > > > > > > Norman Weathers > > -- > > To unsubscribe from this list: send the line "unsubscribe > > linux-nfs" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html