Re: XFS and USB Hang on 2.6.35.13

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 30 Jun 2011 22:19:18 +1000

On Thu, Jun 30, 2011 at 04:57:42PM +0530, Amit Sahrawat wrote:
> Hi All,
> I encountered a hang on XFS during unplug.
> *Test Case:*
> #!/bin/sh
> index=0
> while [ "$?" == 0 ]
> do
>         index=$(($index+1))
>         sync
>         cp /mnt/1KB.txt /tmp/"$index".test
> done
> Where /mnt - mount point for vfat and /tmp mount point for XFS, both can be
> XFS also.
> 
> During this operation, unplug the USB. I am getting HANG almost everytime I
> unplug.

Well, that's no surprise. The unplug appears to be losing IOs in
progress.

> *Kernel Version:* 2.6.35.13 (extremely sorry, I know next question will be
> why am I not using TOT kernel - I tried but my PC does not boot up with the
> latest one)
> 
> *Target=ARM*
> *Logs Using Kernel Hung Task Feature*
> # sh test.sh
> usb 2-1: USB disconnect, address 2
> sd 0:0:0:0: [sda] Unhandled error code
> sd 0:0:0:0: [sda] Result: hostbyte=0x01 driverbyte=0x00
> sd 0:0:0:0: [sda] CDB: cdb[0]=0x2a: 2a 00 00 40 08 01 00 00 02 00
> end_request: I/O error, dev sda, sector 4196353
> sd 0:0:0:0: [sda] Unhandled error code
> sd 0:0:0:0: [sda] Result: hostbyte=0x01 driverbyte=0x00
> sd 0:0:0:0: [sda] CDB: cdb[0]=0x2a: 2a 00 00 40 08 08 00 00 18 00
> end_request: I/O error, dev sda, sector 4196360
> end_request: I/O error, dev sda, sector 6293645
> Device sda3, XFS metadata write error block 0x1 in sda3
> xfs_force_shutdown(sda3,0x1) called from line 1031 of file
> fs/xfs/xfs_buf_item.c.  Return address = 0xc0507b1c

So the device was unplugged, there was a disconnect error, a few IO
errors and then a shutdown.

> 
> *INFO: task khubd:*33 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> khubd         D c06c261c     0    33      2 0x00000000
> Backtrace:
> [<c06c2210>] (schedule+0x0/0x500) from [<c0523f4c>]
> (_xfs_log_force+0x230/0x284)

You need to turn off line wrapping for stuff you paste into email.
The cleaned up (i.e. relevant part) trace is:

[<c06c2210>] (schedule+0x0/0x500)
[<c0523d1c>] (_xfs_log_force+0x0/0x284)
[<c052417c>] (xfs_log_force+0x0/0x38)
[<c0544e94>] (xfs_sync_data+0x0/0x58)
[<c0544f20>] (xfs_quiesce_data+0x0/0x80)
[<c05421e4>] (xfs_fs_sync_fs+0x0/0xe0)
[<c048fa74>] (__sync_filesystem+0x0/0xa0)
[<c048fb88>] (sync_filesystem+0x0/0x60)
[<c0499104>] (fsync_bdev+0x0/0x44)
[<c056c680>] (invalidate_partition+0x0/0x3c)
[<c04b88e0>] (del_gendisk+0x0/0x140)
[<c05c78a0>] (sd_remove+0x0/0x84)
[<c05b27f4>] (__device_release_driver+0x0/0xac)
[<c05b2954>] (device_release_driver+0x0/0x30)
[<c05b1ddc>] (bus_remove_device+0x0/0x8c)
[<c05b02d8>] (device_del+0x0/0x170)
[<c05c4d5c>] (__scsi_remove_device+0x0/0x90)
[<c05c23bc>] (scsi_forget_host+0x0/0x6c)
[<c05bc38c>] (scsi_remove_host+0x0/0x104)
[<c0612f94>] (quiesce_and_remove_host+0x0/0x9c)
[<c06130b4>] (usb_stor_disconnect+0x0/0x28)
[<c0601614>] (usb_unbind_interface+0x0/0xdc)
[<c05b27f4>] (__device_release_driver+0x0/0xac)
[<c05b2954>] (device_release_driver+0x0/0x30)
[<c05b1ddc>] (bus_remove_device+0x0/0x8c)
[<c05b02d8>] (device_del+0x0/0x170)
[<c05ff06c>] (usb_disable_device+0x0/0xf8)
[<c05fa8e0>] (usb_disconnect+0x0/0xf4)
[<c05fabd8>] (hub_thread+0x0/0xd78)
[<c041e61c>] (kthread+0x0/0x8c)

Well, that just looks utterly braindamaged to me.

We just had the device containing the filesystem removed from the
system, so the error handling routine ends up trying to sync the
filesystem to the device that doesn't exist anymore. WTF?

Anyway, that's not the cause of the hang, but just an example of
someone not thinking through what their error handling actually
does...

> *INFO: task xfslogd/1*:40 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> xfslogd/1     D c06c261c     0    40      2 0x00000000
> Backtrace:

[<c06c2210>] (schedule+0x0/0x500)
[<c0523d1c>] (_xfs_log_force+0x0/0x284)
[<c0523fa0>] (xfs_log_force_umount+0x0/0x1dc)
[<c05374cc>] (xfs_do_force_shutdown+0x0/0x164)
[<c05079d0>] (xfs_buf_iodone_callbacks+0x0/0x184)
[<c053b888>] (xfs_buf_iodone_work+0x0/0x7c)
[<c041a768>] (worker_thread+0x0/0x1e4)
[<c041e61c>] (kthread+0x0/0x8c)

That's where the shutdown has hung - only the xfslogd can complete
the IO that will allow the log force to complete, and that is not
occurring because it is waiting for the log force to complete before
it can complete the IO that will complete the log force...

AFAICT, this problem doesn't exist in TOT - the conversion of the
xfslogd workqueue to CMWQ allows processing of other xfslogd
workqueue events to continue even though this one has gone to sleep.

You probably need to change the shutdown type to
SHUTDOWN_LOG_IO_ERROR to prevent a log flush from occurring in this
shutdown context.

>  r7:00000013 r6:c040c04c r5:c041e61c r4:dbc2decc
> *INFO: task sync:*164 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> sync          D c06c261c     0   164    136 0x00000000
> Backtrace:
[<c06c2210>] (schedule+0x0/0x500)
[<c06c295c>] (schedule_timeout+0x0/0x200)
[<c06c2000>] (wait_for_common+0x0/0x164)
[<c06c21f4>] (wait_for_completion+0x0/0x1c)
[<c053a6f8>] (xfs_buf_iowait+0x0/0x5c)
[<c053ba50>] (xfs_flush_buftarg+0x0/0x180)
[<c0544f20>] (xfs_quiesce_data+0x0/0x80)
[<c05421e4>] (xfs_fs_sync_fs+0x0/0xe0)
[<c048fa74>] (__sync_filesystem+0x0/0xa0)
[<c048fb14>] (sync_one_sb+0x0/0x30)
[<c0470c2c>] (iterate_supers+0x0/0xb8)
[<c048fa10>] (sync_filesystems+0x0/0x2c)
[<c048fb44>] (sys_sync+0x0/0x44)

And that one is probably stuck waiting for the xfslogd to complete
the IO.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs