Re: XFS umount with IO errors seems to lead to memory corruption

Alex Lyakas <alex@xxxxxxxxxxxxxxxxx> · Sun, 24 Nov 2013 12:27:08 +0200

Hi Dave,
thank you for your comments.

The test that I am doing is unmounting the XFS, while its underlying
block device returns intermittent IO errors. The block device in
question is a custom DeviceMapper target. It returns -ECANCELED in
this case. Should I return some other errno instead?
The same exact test works alright with ext4. It's unmount finishes,
system seems to continue functioning normally and kmemleak is also
happy.

When doing a simpler reproductoin with "error" Device-Mapper, umount
gets stuck and never returns, while kernel keeps printing:
XFS (dm-0): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks")
error 5 numblks 1
XFS (dm-0): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks")
error 5 numblks 1
XFS (dm-0): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks")
error 5 numblks 1
XFS (dm-0): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks")
error 5 numblks 1
XFS (dm-0): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks")
error 5 numblks 1
XFS (dm-0): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks")
error 5 numblks 1
XFS (dm-0): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks")
error 5 numblks 1
...

What is the expected behavior of umount in this case? (when there are
IO errors during it).

This reproduction that I do is:

root@vc-00-00-1075-dev:~# cat /etc/xfs.protofile
dummy                   : bootfilename, not used, backward compatibility
0 0                             : numbers of blocks and inodes, not
used, backward compatibility
d--777 0 0              : set 777 perms for the root dir
$
$

root@vc-00-00-1075-dev:~# mkfs.xfs -f /dev/vde -p /etc/xfs.protofile
meta-data=/dev/vde               isize=256    agcount=4, agsize=1310720 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=5242880, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

root@vc-00-00-1075-dev:~# blockdev --getsz /dev/vde
41943040

root@vc-00-00-1075-dev:~# dmsetup create VDE --table "0 41943040
linear /dev/vde 0"

root@vc-00-00-1075-dev:~# mount -o noatime,sync /dev/mapper/VDE
/export/v122_nas/

copy some files to the mountpoint, then:

dmsetup reload VDE --table "0 41943040 error"
dmsetup resume VDE

umount /export/v122_nas

this never returns and /proc shows:
root@vc-00-00-1075-dev:~# cat /proc/2684/stack
[<ffffffffa033ac6a>] xfs_ail_push_all_sync+0x9a/0xd0 [xfs]
[<ffffffffa0330123>] xfs_unmountfs+0x63/0x160 [xfs]
[<ffffffffa02ee265>] xfs_fs_put_super+0x25/0x60 [xfs]
[<ffffffff8118fd12>] generic_shutdown_super+0x62/0xf0
[<ffffffff8118fdd0>] kill_block_super+0x30/0x80
[<ffffffff811903dc>] deactivate_locked_super+0x3c/0x90
[<ffffffff81190d7e>] deactivate_super+0x4e/0x70
[<ffffffff811ad086>] mntput_no_expire+0x106/0x160
[<ffffffff811ae760>] sys_umount+0xa0/0xe0
[<ffffffff816ab919>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

And after some time, hung task warning shows:
INFO: task kworker/2:1:39 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/2:1     D ffffffff8180cf00     0    39      2 0x00000000
 ffff88007c54db38 0000000000000046 000000027d003700 ffff88007fd03fc0
 ffff88007c54dfd8 ffff88007c54dfd8 ffff88007c54dfd8 0000000000013e40
 ffff88007c9e9710 ffff88007c4bdc40 00000000000000b8 7fffffffffffffff
Call Trace:
 [<ffffffff816a1b99>] schedule+0x29/0x70
 [<ffffffff816a02d5>] schedule_timeout+0x1e5/0x250
 [<ffffffffa02f3987>] ? kmem_zone_alloc+0x67/0xe0 [xfs]
 [<ffffffff816798e6>] ? kmemleak_alloc+0x26/0x50
 [<ffffffff816a0f1b>] __down_common+0xa0/0xf0
 [<ffffffffa032f37c>] ? xfs_getsb+0x3c/0x70 [xfs]
 [<ffffffff816a0fde>] __down+0x1d/0x1f
 [<ffffffff81084591>] down+0x41/0x50
 [<ffffffffa02dcd44>] xfs_buf_lock+0x44/0x110 [xfs]
 [<ffffffffa032f37c>] xfs_getsb+0x3c/0x70 [xfs]
 [<ffffffffa033b4bc>] xfs_trans_getsb+0x4c/0x140 [xfs]
 [<ffffffffa032f06e>] xfs_mod_sb+0x4e/0xc0 [xfs]
 [<ffffffffa02e3b24>] xfs_fs_log_dummy+0x54/0x90 [xfs]
 [<ffffffffa0335bf8>] xfs_log_worker+0x48/0x50 [xfs]
 [<ffffffff81077a11>] process_one_work+0x141/0x4a0
 [<ffffffff810789e8>] worker_thread+0x168/0x410
 [<ffffffff81078880>] ? manage_workers+0x120/0x120
 [<ffffffff8107df10>] kthread+0xc0/0xd0
 [<ffffffff813a3ea4>] ? acpi_get_child+0x47/0x4d
 [<ffffffff813a3fb7>] ? acpi_platform_notify.part.0+0xbb/0xda
 [<ffffffff8107de50>] ? flush_kthread_worker+0xb0/0xb0
 [<ffffffff816ab86c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8107de50>] ? flush_kthread_worker+0xb0/0xb0

[3] contains some of information that wiki asks for, right before
doing the unmount in the original test. I have also re-ran the
original test with slub_debug=FZ, but it did not show up any warnings
(perhaps I am missing something on how to enable the red-zone
debugging).

Thanks,
Alex.

[3]
root@vc-00-00-1075-dev:/mnt/work/alex# uname -a
Linux vc-00-00-1075-dev 3.8.13-557-generic #1382000791 SMP Thu Oct 17
11:22:20 IST 2013 x86_64 x86_64 x86_64 GNU/Linux

this is a manually-compiled 3.8.13 kernel, in which:
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK=y
but no code changes.

root@vc-00-00-1075-dev:/mnt/work/alex# xfs_repair -V
xfs_repair version 3.1.7

number of CPUs: 4

root@vc-00-00-1075-dev:/mnt/work/alex# cat /proc/meminfo
MemTotal:        2050940 kB
MemFree:          996376 kB
Buffers:           32836 kB
Cached:           389208 kB
SwapCached:            0 kB
Active:           490232 kB
Inactive:         239988 kB
Active(anon):     311336 kB
Inactive(anon):     1336 kB
Active(file):     178896 kB
Inactive(file):   238652 kB
Unevictable:        4632 kB
Mlocked:            4632 kB
SwapTotal:        522236 kB
SwapFree:         522236 kB
Dirty:                96 kB
Writeback:             0 kB
AnonPages:        312920 kB
Mapped:            46656 kB
Shmem:              1456 kB
Slab:             224640 kB
SReclaimable:      30040 kB
SUnreclaim:       194600 kB
KernelStack:        3632 kB
PageTables:        14584 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1547704 kB
Committed_AS:    1667900 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       30860 kB
VmallocChunk:   34359648832 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      153588 kB
DirectMap2M:     1943552 kB

root@vc-00-00-1075-dev:/mnt/work/alex# cat /proc/mounts
rootfs / rootfs rw 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=1016840k,nr_inodes=254210,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,relatime,size=410188k,mode=755 0 0
/dev/disk/by-uuid/c2a6ff28-cf03-4785-9ac8-60dd4ed7d3af / ext4
rw,relatime,errors=remount-ro,data=ordered 0 0
none /sys/fs/fuse/connections fusectl rw,relatime 0 0
none /sys/kernel/debug debugfs rw,relatime 0 0
none /sys/kernel/security securityfs rw,relatime 0 0
none /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=10256k 0 0
none /run/shm tmpfs rw,nosuid,nodev,relatime 0 0
rpc_pipefs /run/rpc_pipefs rpc_pipefs rw,relatime 0 0
172.16.0.12:/export2/work /mnt/work nfs4
rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.16.0.35,local_lock=none,addr=172.16.0.12
0 0
172.16.0.12:/export2/share /mnt/share nfs4
rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.16.0.35,local_lock=none,addr=172.16.0.12
0 0
nfsd /proc/fs/nfsd nfsd rw,relatime 0 0
cgroup_root /sys/fs/cgroup tmpfs rw,relatime 0 0
blkio /sys/fs/cgroup/blkio cgroup
rw,relatime,perf_event,blkio,freezer,devices,cpuacct,cpu,cpuset 0 0
/dev/mapper/vpart-1 /mnt/bootdir ext4 rw,relatime,(null),debug,data=ordered 0 0
/dev/dm-20 /btrfs/pool-00000007 btrfs
rw,noatime,nodatasum,nodatacow,notreelog,nospace_cache,clear_cache 0 0
/dev/dm-22 /export/v122_nas xfs rw,sync,noatime,wsync,attr2,inode64,noquota 0 0

root@vc-00-00-1075-dev:/mnt/work/alex# cat /proc/partitions
major minor  #blocks  name

 254        0   52428800 vda
 254       16   52428800 vdb
 254       32   20971520 vdc
 254       48   20971520 vdd
 254       64   20971520 vde
 254       80     512000 vdf
 254       96     512000 vdg
 254      112     512000 vdh
 254      128     512000 vdi
 254      144     512000 vdj
 254      160    1048576 vdk
 254      176    1048576 vdl
 254      192     312320 vdm
 254      208     312320 vdn
   8        0    5242880 sda
   8        1    4717568 sda1
   8        2          1 sda2
   8        5     522240 sda5
  11        0    1048575 sr0
 253        0 140737488355327 dm-0
 253        1     512000 dm-1
 253        2     512000 dm-2
 253        3     512000 dm-3
   9        1     511988 md1
 253        4     511988 dm-4
 253        5    1048576 dm-5
 253        7     312320 dm-7
 253        6   20971520 dm-6
 253        8   52428800 dm-8
 253        9   20971520 dm-9
 253       10    1048576 dm-10
 253       11     312320 dm-11
 253       12   52428800 dm-12
   9        5   18872320 md5
   9        2    1048564 md2
   9        3     312308 md3
   9        4   50329600 md4
 253       13   18872320 dm-13
 253       14     312308 dm-14
 253       15    1048564 dm-15
 253       16   50329600 dm-16
 253       17   49283072 dm-17
 253       18   49283072 dm-18
 253       19   49283072 dm-19
 253       20   49283072 dm-20
 253       21  127926272 dm-21
 253       22  127926272 dm-22

root@vc-00-00-1075-dev:/mnt/work/alex# xfs_info /export/v122_nas/
meta-data=/dev/dm-22             isize=256    agcount=4, agsize=7995392 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=31981568, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=15616, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

On Fri, Nov 22, 2013 at 12:07 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> [cc'd the correct xfs list. PLease use xfs@xxxxxxxxxxx in future.]
>
> On Thu, Nov 21, 2013 at 08:04:36PM +0200, Alex Lyakas wrote:
>> Greetings,
>> I am using stock XFS from kernel 3.8.13, compiled with kmemleak
>> enabled. I am testing a particular scenario, in which the underlying
>> block device returns IO errors during XFS umount. Almost in all cases
>> this results in kernel crashes in various places, and sometimes
>> kmemleak complains, and sometimes CPU soft lockup happens. It always
>> happens after XFS messages like:
>
> What testing are you doing?
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
>>  kernel: [  600.190509] XFS (dm-22): metadata I/O error: block
>> 0x7600030 ("xlog_iodone") error 125 numblks 64
>>  kernel: [  600.192267] XFS (dm-22): xfs_do_force_shutdown(0x2) called
>> from line 1115 of file
>> /mnt/compile/linux-stable/source/fs/xfs/xfs_log.c.  Return address =
>> 0xffffffffa05cffa1
>>  kernel: [  600.192319] XFS (dm-22): Log I/O Error Detected.  Shutting
>> down filesystem
>>  kernel: [  600.192392] XFS (dm-22): Unable to update superblock
>> counters. Freespace may not be correct on next mount.
>>  kernel: [  600.192398] XFS (dm-22): xfs_log_force: error 5 returned.
>>  kernel: [  600.193687] XFS (声.Z): Please umount the filesystem and
>> rectify the problem(s)
>>
>> you can see here the garbage that XFS prints instead of the block device name.
>> In [1] and [2] I am attaching more kernel log from two such crashes.
>
> So, something is corrupting memory and stamping all over the XFS
> structures.
>
> What's error 125?
>
> #define ECANCELED       125     /* Operation Canceled */
>
> I can't find an driver that actually returns that error to
> filesystems, which....
>
>
>>  kernel: [  600.227881] Modules linked in: xfs raid1 xfrm_user
>> xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 scst_vdisk(O)
>> iscsi_scst(O) scst(O) dm_zcache(O) dm_btrfs(O) btrfs(O) libcrc32c
>> dm_iostat(O)
>
> .... given you have a bunch of out of tree modules loaded (and some which
> are experiemental) suggests that you have a problem with your
> storage...
>
> So, something is corrupting memory across a large number of
> subsystems, and the trigger is some custom code to run error
> injection. Can you reproduce the problem with something like
> dm-faulty?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html