Re: xfs umount with i/o error hang/memory corruption

Bob Mastors <bob.mastors@xxxxxxxxxxxxx> · Fri, 4 Apr 2014 15:40:38 -0600

On Fri, Apr 4, 2014 at 3:20 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:

On Fri, Apr 04, 2014 at 12:15:23PM -0600, Bob Mastors wrote:

> Greetings,

>

> I am new to xfs and am running into a problem

> and would appreciate any guidance on how to proceed.

>

> After an i/o error from the block device that xfs is using,

> an umount results in a message like:

> [  370.636473] XFS (sdx): Log I/O Error Detected.  Shutting down filesystem

> [  370.644073] XFS (h           ���h"h          ���H#h          ���bsg):

> Please umount the filesystem and rectify the problem(s)

> Note the garbage on the previous line which suggests memory corruption.

> About half the time I get the garbled log message. About half the time

> umount hangs.

I got an email about this last night with a different trigger - thin

provisioning failing log IO in the unmount path. I know what the

problem is, I just don't have a fix for it yet.

To confirm it's the same problem, can you post the entirity of the

dmesg where the error occurs?
Here is the console output from putting a BUG() call  in xfs_do_force_shutdown
after the "Please umount ..." message.
Let me know if want something else.

I don't know how much help I can be on this, but I can run more tests
or try out fixes.

Thanks,
Bob

[  111.004665] XFS (sdb): Mounting Filesystem
[  111.099442] XFS (sdb): Ending clean mount
[  129.548644] XFS (sdb): metadata I/O error: block 0x19022 ("xlog_iodone") error 19 numblks 64
[  129.618739] XFS (sdb): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa04c8001
[  129.699768] XFS (sdb): Log I/O Error Detected.  Shutting down filesystem
[  129.699924] XFS (sdb): Unable to update superblock counters. Freespace may not be correct on next mount.
[  129.699927] XFS (sdb): xfs_log_force: error 5 returned.
[  129.716031] XFS (P専/): Please umount the filesystem and rectify the problem(s)
[  129.718411] ------------[ cut here ]------------
[  129.719120] Kernel BUG at ffffffffa04726ac [verbose debug info unavailable]
[  129.720016] invalid opcode: 0000 [#1] SMP
[  129.720016] Modules linked in: xfs(F) libcrc32c ib_srpt tcm_qla2xxx qla2xxx tcm_loop tcm_fc libfc scsi_transport_fc scsi_tgt iscsi_target_mod target_core_pscsi target_core_file target_core_iblock target_core_mod configfs ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd nfs_acl nfs lockd fscac\
he sunrpc snd_intel8x0 snd_ac97_codec parport_pc parport ac97_bus snd_pcm psmouse snd_timer serio_raw snd joydev evbug mac_hid pcspkr i2c_piix4 soundcore hid_generic usbhid hid e1000 ahci libahci
[  129.720016] CPU: 1 PID: 234 Comm: kworker/1:1H Tainted: GF            3.14.0+ #1
[  129.720016] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006
[  129.720016] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[  129.720016] task: ffff88002f598000 ti: ffff88002fecc000 task.ti: ffff88002fecc000
[  129.720016] RIP: 0010:[<ffffffffa04726ac>]  [<ffffffffa04726ac>] xfs_do_force_shutdown+0x9c/0x170 [xfs]
[  129.720016] RSP: 0018:ffff88002fecdd68  EFLAGS: 00010296
[  129.720016] RAX: 0000000000000043 RBX: 0000000000000002 RCX: 0000000000000000
[  129.720016] RDX: ffff88003fd0efb0 RSI: ffff88003fd0d3d8 RDI: ffff88003fd0d3d8
[  129.720016] RBP: ffff88002fecdd98 R08: 0000000000000000 R09: 0000000000000000
[  129.720016] R10: 00000000000001cf R11: 00000000000001ce R12: ffff880028865000
[  129.720016] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
[  129.720016] FS:  0000000000000000(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[  129.720016] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  129.720016] CR2: 00000000006105e8 CR3: 0000000001c0e000 CR4: 00000000000006e0
[  129.720016] Stack:
[  129.720016]  ffffffffa04c8001 0000000000000000 ffff880030140300 ffff88003026dc00

[  129.720016]  ffff88002fa13980 0000000000000000 ffff88002fecddc8 ffffffffa04c8001
[  129.720016]  ffff88003653cf48 ffff88002fa13a20 ffff88002fa13980 ffffe8ffffd01e00
[  129.720016] Call Trace:

[  129.720016]  [<ffffffffa04c8001>] ? xlog_iodone+0x91/0xa0 [xfs]
[  129.720016]  [<ffffffffa04c8001>] xlog_iodone+0x91/0xa0 [xfs]
[  129.720016]  [<ffffffffa046ac3e>] xfs_buf_iodone_work+0x5e/0xc0 [xfs]
[  129.720016]  [<ffffffff8106ad04>] process_one_work+0x174/0x4a0
[  129.720016]  [<ffffffff8106c0ab>] worker_thread+0x11b/0x370
[  129.720016]  [<ffffffff8106bf90>] ? manage_workers.isra.20+0x2d0/0x2d0
[  129.720016]  [<ffffffff81072c29>] kthread+0xc9/0xe0
[  129.720016]  [<ffffffff81072b60>] ? flush_kthread_worker+0xb0/0xb0
[  129.720016]  [<ffffffff816d6aec>] ret_from_fork+0x7c/0xb0
[  129.720016]  [<ffffffff81072b60>] ? flush_kthread_worker+0xb0/0xb0
[  129.720016] Code: d0 45 85 ed 0f 85 a5 00 00 00 f6 c3 20 66 90 75 7e 83 e3 10 0f 84 b5 00 00 00 48 c7 c6 18 8c 4e a0 4c 89 e7 31 c0 e8 54 92 00 00 <0f> 0b 66 90 48 8b 45 08 49 89 d1 41 89 c8 48 c7 c2 d0 4f 4e a0
[  129.720016] RIP  [<ffffffffa04726ac>] xfs_do_force_shutdown+0x9c/0x170 [xfs]
[  129.720016]  RSP <ffff88002fecdd68>
[  129.777164] ---[ end trace d335180e3c345156 ]---
[  129.777953] BUG: unable to handle kernel paging request at ffffffffffffffd8
[  129.779087] IP: [<ffffffff81072c80>] kthread_data+0x10/0x20
[  129.779960] PGD 1c0f067 PUD 1c11067 PMD 0
[  129.780900] Oops: 0000 [#2] SMP
[  129.781150] Modules linked in: xfs(F) libcrc32c ib_srpt tcm_qla2xxx qla2xxx tcm_loop tcm_fc libfc scsi_transport_fc scsi_tgt iscsi_target_mod target_core_pscsi target_core_file target_core_iblock target_core_mod configfs ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd nfs_acl nfs lockd fscac\
he sunrpc snd_intel8x0 snd_ac97_codec parport_pc parport ac97_bus snd_pcm psmouse snd_timer serio_raw snd joydev evbug mac_hid pcspkr i2c_piix4 soundcore hid_generic usbhid hid e1000 ahci libahci
[  129.781150] CPU: 1 PID: 234 Comm: kworker/1:1H Tainted: GF     D      3.14.0+ #1
[  129.781150] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006
[  129.781150] task: ffff88002f598000 ti: ffff88002fecc000 task.ti: ffff88002fecc000
[  129.781150] RIP: 0010:[<ffffffff81072c80>]  [<ffffffff81072c80>] kthread_data+0x10/0x20
[  129.781150] RSP: 0018:ffff88002fecd9f8  EFLAGS: 00010046
[  129.781150] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81e75380
[  129.781150] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88002f598000
[  129.781150] RBP: ffff88002fecd9f8 R08: 0000000000000000 R09: 0000000000000164
[  129.781150] R10: ffffffff8104c8b4 R11: 0000000000000000 R12: 0000000000000001
[  129.781150] R13: ffff88002f5984e0 R14: ffff88003d9f8000 R15: ffff88002f5983d8
[  129.781150] FS:  0000000000000000(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[  129.781150] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  129.781150] CR2: 0000000000000028 CR3: 0000000001c0e000 CR4: 00000000000006e0
[  129.781150] Stack:
[  129.781150]  ffff88002fecda18 ffffffff8106c526 ffff88003fd13140 0000000000000001
[  129.781150]  ffff88002fecda98 ffffffff816ca7cf 0000000000000001 ffff88002f598000

[  129.781150]  ffff88002f598000 0000000000013140 ffff88002fecdfd8 0000000000013140
[  129.781150] Call Trace:
[  129.781150]  [<ffffffff8106c526>] wq_worker_sleeping+0x16/0xa0

[  129.781150]  [<ffffffff816ca7cf>] __schedule+0x56f/0x7c0
[  129.781150]  [<ffffffff816caaf9>] schedule+0x29/0x70
[  129.781150]  [<ffffffff81052075>] do_exit+0x6e5/0xa70

[  129.781150]  [<ffffffff816c2303>] ? printk+0x4d/0x4f
[  129.781150]  [<ffffffff816cf0e1>] oops_end+0xa1/0xf0
[  129.781150]  [<ffffffff81006208>] die+0x58/0x90
[  129.781150]  [<ffffffff816ce93b>] do_trap+0xcb/0x170
[  129.781150]  [<ffffffff8100308d>] do_invalid_op+0xad/0x100
[  129.781150]  [<ffffffffa04726ac>] ? xfs_do_force_shutdown+0x9c/0x170 [xfs]
[  129.781150]  [<ffffffff816c2303>] ? printk+0x4d/0x4f
[  129.781150]  [<ffffffff816d8088>] invalid_op+0x18/0x20
[  129.781150]  [<ffffffffa04726ac>] ? xfs_do_force_shutdown+0x9c/0x170 [xfs]
[  129.781150]  [<ffffffffa04726ac>] ? xfs_do_force_shutdown+0x9c/0x170 [xfs]
[  129.781150]  [<ffffffffa04c8001>] ? xlog_iodone+0x91/0xa0 [xfs]
[  129.781150]  [<ffffffffa04c8001>] xlog_iodone+0x91/0xa0 [xfs]
[  129.781150]  [<ffffffffa046ac3e>] xfs_buf_iodone_work+0x5e/0xc0 [xfs]
[  129.781150]  [<ffffffff8106ad04>] process_one_work+0x174/0x4a0
[  129.781150]  [<ffffffff8106c0ab>] worker_thread+0x11b/0x370
[  129.781150]  [<ffffffff8106bf90>] ? manage_workers.isra.20+0x2d0/0x2d0
[  129.781150]  [<ffffffff81072c29>] kthread+0xc9/0xe0
[  129.781150]  [<ffffffff81072b60>] ? flush_kthread_worker+0xb0/0xb0
[  129.781150]  [<ffffffff816d6aec>] ret_from_fork+0x7c/0xb0
[  129.781150]  [<ffffffff81072b60>] ? flush_kthread_worker+0xb0/0xb0
[  129.781150] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 88 04 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  129.781150] RIP  [<ffffffff81072c80>] kthread_data+0x10/0x20
[  129.781150]  RSP <ffff88002fecd9f8>
[  129.781150] CR2: ffffffffffffffd8
[  129.781150] ---[ end trace d335180e3c345157 ]---
[  129.781150] Fixing recursive fault but reboot is needed!

In essence, the log IO failure is triggering a shutdown, and as part

of the shutdown process it wakes anyone waiting on a log force.

The log quiesce code that waits for log completion during unmount

uses a log force to ensure the log is idle before tearing down all

the log structures and finishing the unmount. Unfortunatey, the log

force the unmount blocks on is woken prematurely by the shutdown,

and hence it runs before the log IO processing is completed. Hence

the use after free.

> And then I get this kind of error and the system is unresponsive:

> Message from syslogd@debian at Apr  4 09:27:40 ...

>  kernel:[  680.080022] BUG: soft lockup - CPU#2 stuck for 22s! [umount:2849]

>

> The problem appears to be similar to this issue:

> http://www.spinics.net/lists/linux-xfs/msg00061.html

Similar symptoms, but not the same bug.

> The problem is triggered by doing an iscsi logout which causes

> the block device to return i/o errors to xfs.

> Steps to reproduce the problem are below.

Seems similar to the thinp ENOSPC issue I mentioned above - data IO

errors occur, then you do an unmount, which causes a log IO error

writing the superblock, and then this happens....

Cheers,

Dave.

--

Dave Chinner

david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs