BUG: soft lockup - CPU#0 stuck for 67s! [kworker/0:5:29244] / xfs_trans_committed_bulk

Alexander Schwarzkopf <alexander.schwarzkopf@xxxxxxxxxxxxxxxx> · Mon, 13 Feb 2012 21:00:15 +0100

Hello,

I have found some hangs with XFS filesystem, but not with this problem.

Our File, NIS- and Webserver runs fine for some months. But then it 
starts hanging.

Feb 10 11:48:17 lin71 kernel: [8794161.252204] BUG: soft lockup - CPU#0 
stuck for 67s! [kworker/0:5:29244]
Feb 10 11:48:17 lin71 kernel: [8794161.252240] Modules linked in: md4 
hmac nls_utf8 cifs btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus 
hfs minix ntfs vfat msdos fat jfs reiserfs ext4 jbd2 crc16 parport_pc 
ppdev lp parport nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc xfs 
ext2 loop snd_pcm snd_timer i2c_i801 sg sr_mod tpm_tis ghes cdrom 
ioatdma i2c_core i7core_edac snd tpm soundcore snd_page_alloc edac_core 
processor tpm_bios dca hed evdev joydev pcspkr psmouse thermal_sys 
serio_raw button ext3 jbd mbcache sd_mod crc_t10dif usbhid hid dm_mod 
usb_storage uas ata_generic uhci_hcd ata_piix libata ehci_hcd e1000e 
3w_sas scsi_mod usbcore [last unloaded: i2c_dev]
Feb 10 11:48:17 lin71 kernel: [8794161.252288] CPU 0
Feb 10 11:48:17 lin71 kernel: [8794161.252289] Modules linked in: md4 
hmac nls_utf8 cifs btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus 
hfs minix ntfs vfat msdos fat jfs reiserfs ext4 jbd2 crc16 parport_pc 
ppdev lp parport nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc xfs 
ext2 loop snd_pcm snd_timer i2c_i801 sg sr_mod tpm_tis ghes cdrom 
ioatdma i2c_core i7core_edac snd tpm soundcore snd_page_alloc edac_core 
processor tpm_bios dca hed evdev joydev pcspkr psmouse thermal_sys 
serio_raw button ext3 jbd mbcache sd_mod crc_t10dif usbhid hid dm_mod 
usb_storage uas ata_generic uhci_hcd ata_piix libata ehci_hcd e1000e 
3w_sas scsi_mod usbcore [last unloaded: i2c_dev]
Feb 10 11:48:17 lin71 kernel: [8794161.252327]
Feb 10 11:48:17 lin71 kernel: [8794161.252329] Pid: 29244, comm: 
kworker/0:5 Not tainted 2.6.39-bpo.2-amd64 #1 Supermicro X8DT6/X8DT6
Feb 10 11:48:17 lin71 kernel: [8794161.252333] RIP: 
0010:[<ffffffffa03573c3>]  [<ffffffffa03573c3>] 
xfs_trans_ail_update_bulk+0x1cc/0x1e0 [xfs]
Feb 10 11:48:17 lin71 kernel: [8794161.252354] RSP: 
0018:ffff88014d553bc0  EFLAGS: 00000202
Feb 10 11:48:17 lin71 kernel: [8794161.252356] RAX: ffff88020faf9df8 
RBX: 0000000000000001 RCX: 00000013001024b4
Feb 10 11:48:17 lin71 kernel: [8794161.252359] RDX: ffff88020faf9d20 
RSI: 0000000000000013 RDI: ffff8801129589c0
Feb 10 11:48:17 lin71 kernel: [8794161.252361] RBP: ffff88011541ac48 
R08: 0000000000000002 R09: dead000000200200
Feb 10 11:48:17 lin71 kernel: [8794161.252363] R10: dead000000100100 
R11: ffff8801bbc58840 R12: ffffffff81339d4e
Feb 10 11:48:17 lin71 kernel: [8794161.252365] R13: ffff88023479d000 
R14: dead000000100100 R15: ffffffff810ec5eb
Feb 10 11:48:17 lin71 kernel: [8794161.252368] FS:  
0000000000000000(0000) GS:ffff88023f200000(0000) knlGS:0000000000000000
Feb 10 11:48:17 lin71 kernel: [8794161.252370] CS:  0010 DS: 0000 ES: 
0000 CR0: 000000008005003b
Feb 10 11:48:17 lin71 kernel: [8794161.252373] CR2: 00007fae00364260 
CR3: 0000000001603000 CR4: 00000000000006f0
Feb 10 11:48:17 lin71 kernel: [8794161.252375] DR0: 0000000000000000 
DR1: 0000000000000000 DR2: 0000000000000000
Feb 10 11:48:17 lin71 kernel: [8794161.252377] DR3: 0000000000000000 
DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 10 11:48:17 lin71 kernel: [8794161.252380] Process kworker/0:5 (pid: 
29244, threadinfo ffff88014d552000, task ffff8802333ad7e0)
Feb 10 11:48:17 lin71 kernel: [8794161.252382] Stack:
Feb 10 11:48:17 lin71 kernel: [8794161.252405]  ffff8801129589c0 
ffff8801129589f0 0000000000000000 ffffffff8103a4d2
Feb 10 11:48:17 lin71 kernel: [8794161.252409]  0000000000000013 
ffff8801e66c8698 0000000000000283 001024b300000001
Feb 10 11:48:17 lin71 kernel: [8794161.252412]  ffff88011541ac48 
ffff88011541ac48 0000000000000000 ffff88011541ac48
Feb 10 11:48:17 lin71 kernel: [8794161.252416] Call Trace:
Feb 10 11:48:17 lin71 kernel: [8794161.252444]  [<ffffffff8103a4d2>] ? 
__wake_up+0x35/0x46
Feb 10 11:48:17 lin71 kernel: [8794161.252457]  [<ffffffffa03560bf>] ? 
xfs_trans_committed_bulk+0xc5/0x13f [xfs]
Feb 10 11:48:17 lin71 kernel: [8794161.252471]  [<ffffffffa034d2fc>] ? 
xlog_cil_committed+0x24/0xc2 [xfs]
Feb 10 11:48:17 lin71 kernel: [8794161.252484]  [<ffffffffa034a232>] ? 
xlog_state_do_callback+0x13a/0x228 [xfs]
Feb 10 11:48:17 lin71 kernel: [8794161.252496]  [<ffffffffa035f66e>] ? 
xfs_buf_relse+0x12/0x12 [xfs]
Feb 10 11:48:17 lin71 kernel: [8794161.252501]  [<ffffffff81059f67>] ? 
process_one_work+0x1d1/0x2ee
Feb 10 11:48:17 lin71 kernel: [8794161.252504]  [<ffffffff8105bec7>] ? 
worker_thread+0x12d/0x247
Feb 10 11:48:17 lin71 kernel: [8794161.252507]  [<ffffffff8105bd9a>] ? 
manage_workers+0x177/0x177
Feb 10 11:48:17 lin71 kernel: [8794161.252509]  [<ffffffff8105bd9a>] ? 
manage_workers+0x177/0x177
Feb 10 11:48:17 lin71 kernel: [8794161.252513]  [<ffffffff8105ef65>] ? 
kthread+0x7a/0x82
Feb 10 11:48:17 lin71 kernel: [8794161.252518]  [<ffffffff8133a4a4>] ? 
kernel_thread_helper+0x4/0x10
Feb 10 11:48:17 lin71 kernel: [8794161.252521]  [<ffffffff8105eeeb>] ? 
kthread_worker_fn+0x147/0x147
Feb 10 11:48:17 lin71 kernel: [8794161.252524]  [<ffffffff8133a4a0>] ? 
gs_change+0x13/0x13
...

The full error message is in here: http://dump.fangornsrealm.eu/error.txt

The scenario I have identified is as this:
- The fileserver is synched against it's mirror server with rscync to 
rsync daemon.
- The memory fills up with caches (inode cache, xfs cache)
- after the sync the memory manager frees the slab memory.
- this is when the hang happens.

At least this is what I have made out of the evidence that I have. Some 
sync scripts
also do delete whole trees of directories with hundreds of thousands of 
hard links.
I have found messages that this workload can produce problems.
But the hangs happen also during the day, when none of these scripts are 
running.

Here is the significant part of the atop log: 
http://dump.fangornsrealm.eu/atop.txt

The machine is a server under Debian Squeeze. The problem is the same 
under the
debian standard kernels.
squeeze:           linux-image-2.6.32-5-amd64
squeeze-backports: linux-image-2.6.39-bpo.2-amd64

Some system information:
http://dump.fangornsrealm.eu/system_info.txt
http://dump.fangornsrealm.eu/modules_lin71.txt
http://dump.fangornsrealm.eu/psaux_lin71.txt

As already written, the sync process works flawless for many days, even 
weeks.
The problem is, that I cannot just reboot this machine whenever I want. 
The whole
department worldwide is dependent on this machine. I know the memory is 
a little
small for filesystems this big. But I don't think that more memory would 
solve
this degradation over time.

Alexander Schwarzkopf

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs