I have a set of machines which each act solely as an NFS server exporting a single 60Tb XFS filesystem. These machines have suffered from infrequent mysterious crashes since new, but recently a new workload has upped the frequency from monthly to less than a day, enabling me to chase the problem harder. Below is a log, captured by netconsole as the machine was going down hard. I think the crucial line is: [93662.200355] Thread overran stack, or stack corrupted. Based on that I compiled up a kernel patched to set THREAD_ORDER in arch/x86/include/asm/page_64_types.h to 2 (ie 16k kernel stacks.) A machine running that kernel has been sat at load average 130 (128 nfsd threads) for 60 hours now and not seen any problems, without the patch the same load would bring it down in less than a day. I'm fairly convinced that I've fixed my problem, but I guess it's worth posting here as the call trace might enable the devs to find and fix the stack hogs. (And then I won't have to run patched kernels in the future.) More information that might help. x86_64 Storage is fibrechannel attached and the filesystem is hosted on a LVM block device that concatentates four partitions, so the block access is going via a stack of LVM, multipath and Q-logic drivers. Network is Intel 10G ethernet (gxbe driver) Kernel is 2.6.32 with Debian patches. (both kernels) Any other information needed, just let me know. Cheers, Simon. [93662.195788] BUG: scheduling while atomic: nfsd/3686/0xffff8800 [93662.195842] Modules linked in: ioatdma netconsole configfs cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative 8021q garp stp nfsd nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext3 jbd mbcache fuse dm_round_robin dm_multipath scsi_dh autofs4 ohci_hcd sd_mod crc_t10dif usbhid hid snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd ipmi_si ixgbe soundcore psmouse ipmi_msghandler ehci_hcd dca snd_page_alloc uhci_hcd hpilo evdev serio_raw container mdio bnx2 usbcore pcspkr nls_base power_meter qla2xxx scsi_transport_fc scsi_tgt processor button xfs exportfs dm_mirror dm_region_hash dm_log dm_snapshot dm_mod thermal fan thermal_sys cciss scsi_mod [93662.196418] Pid: 3686, comm: nfsd Not tainted 2.6.32-bpo.5-amd64 #1 [93662.196758] Call Trace: [93662.196799] [<ffffffff812fa0a9>] ? schedule+0xce/0x7da [93662.196837] [<ffffffff81176ae4>] ? elv_insert+0xad/0x260 [93662.196871] [<ffffffff812fabeb>] ? schedule_timeout+0x2e/0xdd [93662.196915] [<ffffffffa0058af7>] ? dm_unplug_all+0x3b/0x4c [dm_mod] [93662.196953] [<ffffffff812fb4cf>] ? __down_common+0x8d/0xde [93662.196992] [<ffffffff81068533>] ? down+0x27/0x38 [93662.197053] [<ffffffffa00e2f58>] ? _xfs_buf_find+0x162/0x1e0 [xfs] [93662.197107] [<ffffffffa00e3030>] ? xfs_buf_get_flags+0x5a/0x13b [xfs] [93662.197162] [<ffffffffa00e3123>] ? xfs_buf_read_flags+0x12/0x7a [xfs] [93662.197220] [<ffffffffa00da819>] ? xfs_trans_read_buf+0x189/0x27e [xfs] [93662.197272] [<ffffffffa00a1c09>] ? xfs_read_agf+0x5a/0x149 [xfs] [93662.197322] [<ffffffffa00a1d1a>] ? xfs_alloc_read_agf+0x22/0xa4 [xfs] [93662.197374] [<ffffffffa00a38cf>] ? xfs_alloc_fix_freelist+0x11b/0x3dd [xfs] [93662.197427] [<ffffffffa00a3d57>] ? xfs_alloc_vextent+0x10e/0x3e3 [xfs] [93662.197479] [<ffffffffa00aea39>] ? xfs_bmap_btalloc+0x54f/0x732 [xfs] [93662.197537] [<ffffffffa00b0f4f>] ? xfs_bmapi+0x876/0x104d [xfs] [93662.197594] [<ffffffffa00c6e67>] ? xfs_iext_get_ext+0x34/0x5a [xfs] [93662.197652] [<ffffffffa00cbf3d>] ? xfs_iomap_write_allocate+0x245/0x387 [xfs] [93662.197725] [<ffffffffa00cc9b2>] ? xfs_iomap+0x213/0x285 [xfs] [93662.197763] [<ffffffff8119b5e8>] ? swiotlb_map_sg_attrs+0xeb/0x107 [93662.197817] [<ffffffffa00e012c>] ? xfs_map_blocks+0x25/0x2c [xfs] [93662.197855] [<ffffffff81191960>] ? radix_tree_delete+0xbf/0x1ba [93662.197908] [<ffffffffa00e0d53>] ? xfs_page_state_convert+0x299/0x565 [xfs] [93662.197950] [<ffffffffa005a3bb>] ? dm_table_any_congested+0x66/0xe6 [dm_mod] [93662.198010] [<ffffffffa00e10b7>] ? xfs_vm_releasepage+0x98/0xa5 [xfs] [93662.198065] [<ffffffffa00e129a>] ? xfs_vm_writepage+0xb0/0xe6 [xfs] [93662.198105] [<ffffffff810bdfd5>] ? shrink_page_list+0x375/0x623 [93662.198140] [<ffffffff810be9b8>] ? shrink_list+0x45c/0x767 [93662.198192] [<ffffffffa00b5249>] ? xfs_btree_lookup_get_block+0x9d/0xac [xfs] [93662.198262] [<ffffffffa00b275f>] ? xfs_bmbt_init_key_from_rec+0xc/0x14 [xfs] [93662.198314] [<ffffffffa00b2e1c>] ? xfs_lookup_get_search_key+0x29/0x3c [xfs] [93662.198349] [<ffffffff810bef43>] ? shrink_zone+0x280/0x342 [93662.198381] [<ffffffff810c000a>] ? try_to_free_pages+0x232/0x38e [93662.198413] [<ffffffff810bcfff>] ? isolate_pages_global+0x0/0x20f [93662.198450] [<ffffffff810ba098>] ? __alloc_pages_nodemask+0x3cd/0x5f5 [93662.198486] [<ffffffff810e6535>] ? new_slab+0x42/0x1ca [93662.198516] [<ffffffff810e68ad>] ? __slab_alloc+0x1f0/0x39b [93662.198560] [<ffffffffa00df806>] ? kmem_zone_alloc+0x5e/0xa4 [xfs] [93662.198605] [<ffffffffa00df806>] ? kmem_zone_alloc+0x5e/0xa4 [xfs] [93662.198637] [<ffffffff810e6d88>] ? kmem_cache_alloc+0x7f/0xf0 [93662.198697] [<ffffffffa00df806>] ? kmem_zone_alloc+0x5e/0xa4 [xfs] [93662.198746] [<ffffffffa00df85a>] ? kmem_zone_zalloc+0xe/0x2e [xfs] [93662.198792] [<ffffffffa00d9979>] ? _xfs_trans_alloc+0x29/0x64 [xfs] [93662.198842] [<ffffffffa00d9bb4>] ? xfs_trans_alloc+0x95/0xa1 [xfs] [93662.198888] [<ffffffffa00d9d57>] ? xfs_trans_unlocked_item+0x20/0x3a [xfs] [93662.198931] [<ffffffffa009ab39>] ? xfs_qm_dqattach+0x32/0x3b [xfs] [93662.198978] [<ffffffffa00cbdab>] ? xfs_iomap_write_allocate+0xb3/0x387 [xfs] [93662.199031] [<ffffffffa00cc9b2>] ? xfs_iomap+0x213/0x285 [xfs] [93662.199076] [<ffffffffa00e012c>] ? xfs_map_blocks+0x25/0x2c [xfs] [93662.199122] [<ffffffffa00cca0f>] ? xfs_iomap+0x270/0x285 [xfs] [93662.199169] [<ffffffffa00e0d53>] ? xfs_page_state_convert+0x299/0x565 [xfs] [93662.199218] [<ffffffffa00e129a>] ? xfs_vm_writepage+0xb0/0xe6 [xfs] [93662.199251] [<ffffffff810ba2ca>] ? __writepage+0xa/0x25 [93662.199283] [<ffffffff810ba951>] ? write_cache_pages+0x20b/0x327 [93662.199314] [<ffffffff810ba2c0>] ? __writepage+0x0/0x25 [93662.199347] [<ffffffff810b4925>] ? __filemap_fdatawrite_range+0x4b/0x54 [93662.199381] [<ffffffff810b4954>] ? filemap_write_and_wait_range+0x26/0x52 [93662.199426] [<ffffffffa00e6d82>] ? xfs_write+0x63b/0x6ea [xfs] [93662.199458] [<ffffffff812fb3f5>] ? down_read+0x9/0x19 [93662.199503] [<ffffffffa00c6c36>] ? xfs_iget+0x401/0x45b [xfs] [93662.199550] [<ffffffffa00e36ad>] ? xfs_file_aio_write+0x0/0x5d [xfs] [93662.199584] [<ffffffff810ee511>] ? do_sync_readv_writev+0xc0/0x107 [93662.199630] [<ffffffffa00d9d57>] ? xfs_trans_unlocked_item+0x20/0x3a [xfs] [93662.199666] [<ffffffff81064d16>] ? autoremove_wake_function+0x0/0x2e [93662.199698] [<ffffffff810ee39d>] ? rw_copy_check_uvector+0x6d/0xe4 [93662.199733] [<ffffffff810eebad>] ? do_readv_writev+0xb2/0x115 [93662.199788] [<ffffffffa0387f5c>] ? nfsd_setuser_and_check_port+0x62/0x7c [nfsd] [93662.199843] [<ffffffffa038987d>] ? nfsd_vfs_write+0x11a/0x329 [nfsd] [93662.199880] [<ffffffffa038a022>] ? nfsd_open+0x137/0x16c [nfsd] [93662.199915] [<ffffffffa038a34b>] ? nfsd_write+0xc5/0xe2 [nfsd] [93662.199952] [<ffffffffa0390784>] ? nfsd3_proc_write+0xc7/0xe5 [nfsd] [93662.199987] [<ffffffffa0385329>] ? nfsd_dispatch+0xdd/0x1b9 [nfsd] [93662.200027] [<ffffffffa02c2513>] ? svc_process+0x403/0x627 [sunrpc] [93662.200067] [<ffffffffa0385772>] ? nfsd+0x0/0x12e [nfsd] [93662.200100] [<ffffffffa0385857>] ? nfsd+0xe5/0x12e [nfsd] [93662.200130] [<ffffffff81064a49>] ? kthread+0x79/0x81 [93662.200162] [<ffffffff81011baa>] ? child_rip+0xa/0x20 [93662.200191] [<ffffffff810649d0>] ? kthread+0x0/0x81 [93662.200220] [<ffffffff81011ba0>] ? child_rip+0x0/0x20 [93662.200256] BUG: unable to handle kernel paging request at 000000006eab11a0 [93662.200294] IP: [<ffffffff8103fd9b>] update_curr+0xf9/0x147 [93662.200331] PGD 0 [93662.200355] Thread overran stack, or stack corrupted [93662.200383] Oops: 0000 [#1] SMP [93662.200414] last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:07:00.1/irq [93662.200461] CPU 1 [93662.200486] Modules linked in: ioatdma netconsole configfs cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative 8021q garp stp nfsd nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext3 jbd mbcache fuse dm_round_robin dm_multipath scsi_dh autofs4 ohci_hcd sd_mod crc_t10dif usbhid hid snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd ipmi_si ixgbe soundcore psmouse ipmi_msghandler ehci_hcd dca snd_page_alloc uhci_hcd hpilo evdev serio_raw container mdio bnx2 usbcore pcspkr nls_base power_meter qla2xxx scsi_transport_fc scsi_tgt processor button xfs exportfs dm_mirror dm_region_hash dm_log dm_snapshot dm_mod [93662.201420] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [93662.201888] [<ffffffff812fb4cf>] ? __down_common+0x8d/0xde [93662.208496] [<ffffffff81068533>] ? down+0x27/0x38 [93662.208539] [<ffffffffa00e2f58>] ? _xfs_buf_find+0x162/0x1e0 [xfs] [93662.208584] [<ffffffffa00e3030>] ? xfs_buf_get_flags+0x5a/0x13b [xfs] [93662.208629] [<ffffffffa00e3123>] ? xfs_buf_read_flags+0x12/0x7a [xfs] [93662.208675] [<ffffffffa00da819>] ? xfs_trans_read_buf+0x189/0x27e [xfs] [93662.208720] [<ffffffffa00a1c09>] ? xfs_read_agf+0x5a/0x149 [xfs] [93662.208762] [<ffffffffa00a1d1a>] ? xfs_alloc_read_agf+0x22/0xa4 [xfs] [93662.208806] [<ffffffffa00a38cf>] ? xfs_alloc_fix_freelist+0x11b/0x3dd [xfs] [93662.208851] [<ffffffffa00a3d57>] ? xfs_alloc_vextent+0x10e/0x3e3 [xfs] [93662.208896] [<ffffffffa00aea39>] ? xfs_bmap_btalloc+0x54f/0x732 [xfs] [93662.208945] [<ffffffffa00b0f4f>] ? xfs_bmapi+0x876/0x104d [xfs] [93662.208995] [<ffffffffa00c6e67>] ? xfs_iext_get_ext+0x34/0x5a [xfs] [93662.209042] [<ffffffffa00cbf3d>] ? xfs_iomap_write_allocate+0x245/0x387 [xfs] [93662.209108] [<ffffffffa00cc9b2>] ? xfs_iomap+0x213/0x285 [xfs] [93662.209424] [<ffffffff810bdfd5>] ? shrink_page_list+0x375/0x623 [93662.209757] [<ffffffff810ba098>] ? __alloc_pages_nodemask+0x3cd/0x5f5 [93662.210078] [<ffffffffa00d9979>] ? _xfs_trans_alloc+0x29/0x64 [xfs] [93662.210575] [<ffffffff810ba951>] ? write_cache_pages+0x20b/0x327 [93662.211100] [<ffffffffa038987d>] ? nfsd_vfs_write+0x11a/0x329 [nfsd] 48 [93662.211857] CR2: 000000006eab11a0 [93662.212189] Kernel panic - not syncing: Fatal exception in interrupt [93662.212497] [<ffffffff8104e387>] ? release_console_sem+0x17e/0x1af [93662.212818] [<ffffffff81011ba0>] ? child_rip+0x0/0x20 [93662.213328] [<ffffffffa0058af7>] ? dm_unplug_all+0x3b/0x4c [dm_mod] [93662.213841] [<ffffffffa00a1c09>] ? xfs_read_agf+0x5a/0x149 [xfs] [93662.214384] [<ffffffffa00cbf3d>] ? xfs_iomap_write_allocate+0x245/0x387 [xfs] [93662.214831] [<ffffffffa005a3bb>] ? dm_table_any_congested+0x66/0xe6 [dm_mod] [93662.215447] [<ffffffff810bef43>] ? shrink_zone+0x280/0x342 [93662.215703] [<ffffffff810e6535>] ? new_slab+0x42/0x1ca [93662.216138] [<ffffffffa00df85a>] ? kmem_zone_zalloc+0xe/0x2e [xfs] [93662.216448] [<ffffffffa009ab39>] ? xfs_qm_dqattach+0x32/0x3b [xfs] [93662.217403] [<ffffffff812fb3f5>] ? down_read+0x9/0x19 [93662.217698] [<ffffffffa00d9d57>] ? xfs_trans_unlocked_item+0x20/0x3a [xfs] [93662.218208] [<ffffffffa038a34b>] ? nfsd_write+0xc5/0xe2 [nfsd] [93662.218482] [<ffffffffa0385772>] ? nfsd+0x0/0x12e [nfsd] [93662.218732] [<ffffffff810649d0>] ? kthread+0x0/0x81 [93662.225424] [<ffffffff81011ba0>] ? child_rip+0x0/0x20 _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs