Dave, One other point I have forgotten to mention is that the parent thread will wait for 5 minutes and then lower the thread priority (from -2 back to 20) and set a global variable to signal the threads to exit. The blocked thread responded well and exit from D state and fclose completed with no error. This cause me to wonder if it is possible that some XFS threads and my application thread might be in a deadlock. Thanks Norman -----Original Message----- From: xfs-bounces@xxxxxxxxxxx [mailto:xfs-bounces@xxxxxxxxxxx] On Behalf Of Dave Chinner Sent: Tuesday, February 12, 2013 12:23 PM To: Cheung, Norman Cc: linux-xfs@xxxxxxxxxxx Subject: Re: Hung in D state during fclose On Tue, Feb 12, 2013 at 04:39:48PM +0000, Cheung, Norman wrote: > It's just as mangled. Write them to a file, make sure it is formatted correctly, and attach it to the email. > [NLC] attached in a file, sorry for the trouble. Also paste trace again below hopefully it will come thru better. The file and the paste came through OK. > > kernel version 3.0.13-0.27 > > What distribution is that from? > [NLC] SUSE Yeah, looks to be a SLES kernel - have you talked to you SuSE support rep about this? [NLC] In the process of making contact > 1. the disk writing thread hung in fclose > > > Tigris_IMC.exe D 0000000000000000 0 4197 4100 0x00000000 > ffff881f3db921c0 0000000000000086 0000000000000000 ffff881f42eb8b80 > ffff880861419fd8 ffff880861419fd8 ffff880861419fd8 ffff881f3db921c0 > 0000000000080000 0000000000000000 00000000000401e0 00000000061805c1 > Call Trace: > [<ffffffff810d89ed>] ? zone_statistics+0x9d/0xa0 [<ffffffffa0402682>] > ? xfs_iomap_write_delay+0x172/0x2b0 [xfs] [<ffffffff813c7e35>] ? > rwsem_down_failed_common+0xc5/0x150 > [<ffffffff811f32a3>] ? call_rwsem_down_write_failed+0x13/0x20 > [<ffffffff813c74ec>] ? down_write+0x1c/0x1d [<ffffffffa03fba8e>] ? > xfs_ilock+0x7e/0xa0 [xfs] [<ffffffffa041b64b>] ? > __xfs_get_blocks+0x1db/0x3d0 [xfs] [<ffffffff81103340>] ? > kmem_cache_alloc+0x100/0x130 [<ffffffff8113fa2e>] ? > alloc_page_buffers+0x6e/0xe0 [<ffffffff81141cdf>] ? > __block_write_begin+0x1cf/0x4d0 [<ffffffffa041b850>] ? > xfs_get_blocks_direct+0x10/0x10 [xfs] [<ffffffffa041b850>] ? > xfs_get_blocks_direct+0x10/0x10 [xfs] [<ffffffff8114226b>] ? > block_write_begin+0x4b/0xa0 [<ffffffffa041b8fb>] ? > xfs_vm_write_begin+0x3b/0x70 [xfs] [<ffffffff810c0258>] ? > generic_file_buffered_write+0xf8/0x250 > [<ffffffffa04207b5>] ? xfs_file_buffered_aio_write+0xc5/0x130 [xfs] > [<ffffffffa042099c>] ? xfs_file_aio_write+0x17c/0x2a0 [xfs] > [<ffffffff81115b28>] ? do_sync_write+0xb8/0xf0 [<ffffffff8119daa4>] ? > security_file_permission+0x24/0xc0 > [<ffffffff8111630a>] ? vfs_write+0xaa/0x190 [<ffffffff81116657>] ? > sys_write+0x47/0x90 [<ffffffff813ce412>] ? > system_call_fastpath+0x16/0x1b So that is doing a write() from fclose, and it's waiting on the inode XFS_ILOCK_EXCL /me wishes that all distros compiled their kernels with frame pointers enabled so that analysing stack traces is better than "I'm guessing that the real stack trace is.... [NLC] the last entry zone_statistics is called only with NUMA enabled, I wonder if I can work around it by turning off NUMA. > 2. flush from another partition > > flush-8:48 D 0000000000000000 0 4217 2 0x00000000 > ffff883fc053f580 0000000000000046 ffff881f40f348f0 ffff881f40e2aa80 > ffff883fabb83fd8 ffff883fabb83fd8 ffff883fabb83fd8 ffff883fc053f580 > ffff883fc27654c0 ffff881f40dfc040 0000000000000001 ffffffff810656f9 > Call Trace: > [<ffffffff810656f9>] ? __queue_work+0xc9/0x390 [<ffffffff811e3e3f>] ? > cfq_insert_request+0xaf/0x4f0 [<ffffffff81065a06>] ? > queue_work_on+0x16/0x20 [<ffffffff813c69cd>] ? > schedule_timeout+0x1dd/0x240 [<ffffffffa041a762>] ? > kmem_zone_zalloc+0x32/0x50 [xfs] [<ffffffff813c7559>] ? > __down+0x6c/0x99 [<ffffffff81070377>] ? down+0x37/0x40 > [<ffffffffa041d59d>] ? xfs_buf_lock+0x1d/0x40 [xfs] > [<ffffffffa041d6a3>] ? _xfs_buf_find+0xe3/0x210 [xfs] > [<ffffffffa041dcb4>] ? xfs_buf_get+0x64/0x150 [xfs] > [<ffffffffa041dfb2>] ? xfs_buf_read+0x12/0xa0 [xfs] > [<ffffffffa04151af>] ? xfs_trans_read_buf+0x1bf/0x2f0 [xfs] > [<ffffffffa03d06c0>] ? xfs_read_agf+0x60/0x1b0 [xfs] > [<ffffffffa03cf3b7>] ? xfs_alloc_update+0x17/0x20 [xfs] > [<ffffffffa03d0841>] ? xfs_alloc_read_agf+0x31/0xd0 [xfs] > [<ffffffffa03d2083>] ? xfs_alloc_fix_freelist+0x433/0x4a0 [xfs] > [<ffffffff810d89ed>] ? zone_statistics+0x9d/0xa0 [<ffffffffa03d23a4>] > ? xfs_alloc_vextent+0x184/0x4a0 [xfs] [<ffffffffa03dc348>] ? > xfs_bmap_btalloc+0x2d8/0x6d0 [xfs] [<ffffffffa03e0efd>] ? > xfs_bmapi+0x9bd/0x11a0 [xfs] [<ffffffffa03d9bbc>] ? > xfs_bmap_search_multi_extents+0xac/0x120 [xfs] [<ffffffffa040293c>] ? > xfs_iomap_write_allocate+0x17c/0x330 [xfs] [<ffffffffa041b20f>] ? > xfs_map_blocks+0x19f/0x1b0 [xfs] [<ffffffffa041c20e>] ? > xfs_vm_writepage+0x19e/0x470 [xfs] [<ffffffff810c97ba>] ? > __writepage+0xa/0x30 [<ffffffff810c9c4d>] ? > write_cache_pages+0x1cd/0x3d0 [<ffffffff810c97b0>] ? > bdi_set_max_ratio+0x90/0x90 [<ffffffff810c9e93>] ? > generic_writepages+0x43/0x70 [<ffffffff81139330>] ? > writeback_single_inode+0x160/0x300 > [<ffffffff811397d4>] ? writeback_sb_inodes+0x104/0x1a0 > [<ffffffff81139cfd>] ? writeback_inodes_wb+0x8d/0x140 > [<ffffffff8113a05b>] ? wb_writeback+0x2ab/0x310 [<ffffffff813cedee>] ? > apic_timer_interrupt+0xe/0x20 [<ffffffff8113a10e>] ? > wb_check_old_data_flush+0x4e/0xa0 [<ffffffff8113a28b>] ? > wb_do_writeback+0x12b/0x160 [<ffffffff8113a332>] ? > bdi_writeback_thread+0x72/0x150 [<ffffffff8113a2c0>] ? > wb_do_writeback+0x160/0x160 [<ffffffff8106b06e>] ? kthread+0x7e/0x90 > [<ffffffff813cf544>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff8106aff0>] ? kthread_worker_fn+0x1a0/0x1a0 > [<ffffffff813cf540>] ? gs_change+0x13/0x13 Thats the writeback thread waiting on an AGF buffer to be unlocked. IOWs, there are probably multiple concurrent allocations to the same AG. But that will be holding the XFS_ILOCK_EXCL lock that the other thread is waiting. What thread is holding the AGF buffer is anyone's guess - it could be waiting on IO completion, which would indicate a problem in the storage layers below XFS. The output in dmesg from sysrq-w (echo w > /proc/sysrq-trigger) might help indicate other blocked threads that could be holding the AGF lock. [NLC] Only 2 threads in D state. I need to wait for the next hang to take another stack trace. Is there any workaround I can reduce the frequencies of this hang. What about reducing the xfssyncd_centisecs? Or other knobs? ..... > RAID layout (hardware and/or software) Hardware RAID 0 , 2 disks per RAID. What RAID controller? SAS or SATA drives? Stripe chunk/segment size? Any BBWC? [NLC] 2 RAID controllers in the system, it hung from disks on both of them. One is SuperMicro 2208 and the other is LSI 9265-8I. I think both use the same chipset. [NLC] 2 SAS disks on each RAID 0 -15K RPM [NLC] I am not sure the segment size (strip size / no. of disks?) but the strip size is 512K (see below dump) [NLC] No BBWC installed [NLC] My sunit=0 and swidth=0. And sectsz=512 Would it help to set this to stripe size? Thanks, Norman /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -L1 -a0 Adapter 0 -- Virtual Drive Information: Virtual Drive: 1 (Target Id: 1) Name : RAID Level : Primary-0, Secondary-0, RAID Level Qualifier-0 Size : 271.945 GB Parity Size : 0 State : Optimal Strip Size : 512 KB Number Of Drives : 2 Span Depth : 1 Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Default Access Policy: Read/Write Current Access Policy: Read/Write Disk Cache Policy : Enabled Encryption Type : None Bad Blocks Exist: No PI type: No PI Is VD Cached: No Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs