Re: [PATCH] xfs: test inode allocation state missmatch corruption

Zorro Lang <zlang@xxxxxxxxxx> · Thu, 29 Mar 2018 11:46:22 +0800

On Wed, Mar 28, 2018 at 09:24:37AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 28, 2018 at 10:06:31PM +0800, Zorro Lang wrote:
> > There's a situation where the directory structure and the inobt
> > thinks the inode is free, but the inode on disk thinks it is still
> > in use. XFS should detect it and prevent the kernel from oopsing
> > on lookup.
> > 
> > Signed-off-by: Zorro Lang <zlang@xxxxxxxxxx>
> > ---
> > 
> > Hi,
> > 
> > There's a weird issue:
> > 
> > When run this case on upstream general kernel(4.16-rc6 without
> > XFS_WARN/XFS_DEBUG config), it trigger a soft lockup bug[1],
> > and the case block there. But if I use Dave's patch:
> > (https://marc.info/?l=linux-xfs&m=152161877728015&w=2)
> > test passed. I don't know if this softlockup bug is what
> > Dave tried to fix in his patch too?
> > 
> > If I test on upstream kernel with XFS_WARN, I didn't hit this
> > soft lockup issue, just below issue as expected:
> > XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inode.c
> > 
> > When I test on RHEL-7 debug kernel (with XFS_WARN), trigger the
> > soft lockup bug again.
> > 
> > Thanks,
> > Zorro
> > 
> > [1]
> > [  455.751099] watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [umount:2631]
> > [  455.781145] Modules linked in: sunrpc coretemp intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni
> > _intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate hpilo intel_rapl_perf wmi ipmi_si iTCO_wdt hpwdt iTCO_vendor_support ipmi_devintf sg ipmi_msghandler acpi_power_meter ioatdma pcs
> > pkr shpchp i2c_i801 pcc_cpufreq dca lpc_ich ip_tables xfs libcrc32c uas usb_storage sd_mod tg3 hwmon mgag200 xhci_pci ptp crc32c_intel serio_raw xhci_hcd hpsa ttm pps_core scsi_transport_sas
> > dm_mirror dm_region_hash dm_log dm_mod dax ipv6 crc_ccitt autofs4
> > [  456.029470] CPU: 12 PID: 2631 Comm: umount Tainted: G             L   4.16.0-rc6+ #3
> > [  456.058306] Hardware name: HP ProLiant DL360 Gen9, BIOS P89 05/06/2015
> > [  456.081804] RIP: 0010:fsnotify_unmount_inodes+0xcc/0x100
> > [  456.099735] RSP: 0018:ffffc900074b3e50 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
> > [  456.127922] RAX: 0000000000000000 RBX: ffff88045cecd178 RCX: 000000000000001b
> > [  456.154306] RDX: 0000000000000001 RSI: ffffc900074b3d30 RDI: ffff88045cecd200
> > [  456.180539] RBP: 0000000000000000 R08: 000000000000000f R09: ffffc900074b3db8
> > [  456.206731] R10: 000000000000035c R11: 0000000000000018 R12: ffff880465c1cd88
> > [  456.232869] R13: ffff880465c1c800 R14: ffff880465c1cd80 R15: 0000000000000000
> > [  456.259048] FS:  00007f698e06b880(0000) GS:ffff88046f500000(0000) knlGS:0000000000000000
> > [  456.292396] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  456.314274] CR2: 000055ae574a4628 CR3: 00000004699d6002 CR4: 00000000001606e0
> > [  456.340388] Call Trace:
> > [  456.345439]  generic_shutdown_super+0x32/0x110
> > [  456.359532]  kill_block_super+0x21/0x50
> > [  456.370883]  deactivate_locked_super+0x3f/0x70
> > [  456.384883]  cleanup_mnt+0x3b/0x70
> > [  456.394269]  task_work_run+0x92/0xb0
> > [  456.404408]  exit_to_usermode_loop+0x6c/0x99
> > [  456.417663]  do_syscall_64+0xf5/0x130
> > [  456.428266]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > [  456.445027] RIP: 0033:0x7f698d2ddb87
> > [  456.455141] RSP: 002b:00007fffb980d058 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> > [  456.483339] RAX: 0000000000000000 RBX: 000055ae5749c080 RCX: 00007f698d2ddb87
> > [  456.509478] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055ae574a3460
> > [  456.535573] RBP: 000055ae574a3460 R08: 000055ae574a3480 R09: 0000000000000000
> > [  456.561797] R10: 00007fffb980cae0 R11: 0000000000000246 R12: 00007f698de58d58
> > [  456.588281] R13: 0000000000000000 R14: 000055ae5749c270 R15: 000055ae5749c080
> > [  456.614425] Code: 8d 98 e0 fe ff ff 74 2c 48 8d bb 88 00 00 00 e8 5b fa 52 00 f6 83 a0 00 00 00 38 75 0e 8b 83 58 01 00 00 85 c0 0f 85 74 ff ff ff <c6> 83 88 00 00 00 00 eb c1 41 c6 85 80 05 00 00 00 48 85 ed 74

Any idea about if this's https://marc.info/?l=linux-xfs&m=152161877728015&w=2 try to fix?

> > 
> > 
> > 
> >  tests/xfs/444     | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tests/xfs/444.out |   2 +
> >  tests/xfs/group   |   1 +
> >  3 files changed, 129 insertions(+)
> >  create mode 100755 tests/xfs/444
> >  create mode 100644 tests/xfs/444.out
> > 
> > diff --git a/tests/xfs/444 b/tests/xfs/444
> > new file mode 100755
> > index 00000000..58848f4f
> > --- /dev/null
> > +++ b/tests/xfs/444
> > @@ -0,0 +1,126 @@
> > +#! /bin/bash
> > +# FS QA Test 444
> > +#
> > +# Test a corruption when the directory structure and the inobt thinks the inode
> > +# is free, but the inode on disk thinks it is still in use.
> > +#
> > +#-----------------------------------------------------------------------
> > +# Copyright (c) 2018 YOUR NAME HERE.  All Rights Reserved.
> 
> Nice patch Mr. HERE.

Ah, I always forgot changing this in V1 patch...

> 
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#-----------------------------------------------------------------------
> > +#
> > +
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1	# failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +	cd /
> > +	rm -f $tmp.*
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +
> > +# remove previous $seqres.full before test
> > +rm -f $seqres.full
> > +
> > +# real QA test starts here
> > +
> > +# Modify as appropriate.
> > +_supported_fs xfs
> > +_supported_os Linux
> > +_require_scratch_nocheck
> > +_require_no_xfs_bug_on_assert
> > +
> > +_filter_dmesg()
> > +{
> > +	local warn1="Internal error xfs_trans_cancel.*fs/xfs/xfs_trans\.c.*"
> > +	local warn2="WARNING:.*fs/xfs/xfs_message\.c:.*assfail.*"
> > +
> > +	sed -e "s#$warn1#Intentional error in xfs_trans_cancel#" \
> > +	    -e "s#$warn2#Intentional warnings in assfail#"
> > +}
> > +# If the expected behivor is kernel warning, dissable dmesg, need more check!
> > +#_disable_dmesg_check
> 
> Why is this commented out?  Can it go away?

Yeah, it should be removed.

> 
> > +
> > +# Use crc=0, due to this crash is only possible on v4 XFS or v5 XFS mounted
> > +# with the ikeep mount option. For all other V5 XFS, this problem cannot
> > +# occur because we don't read inodes we are allocating from disk - we simply
> > +# overwrite them with the new inode information.
> > +_scratch_mkfs_xfs -m crc=0 >> $seqres.full 2>&1
> > +blksz=$(_scratch_xfs_get_sb_field blocksize)
> > +agcount=$(_scratch_xfs_get_sb_field agcount)
> > +
> > +_scratch_mount
> > +# Create a directory for later allocation in same AG (AG 0, due to this's an
> > +# empty XFS for now)
> > +mkdir $SCRATCH_MNT/dir
> > +
> > +# Allocate 1 block for testfile
> > +$XFS_IO_PROG -fc 'pwrite 0 $blksz' -c fsync $SCRATCH_MNT/dir/testfile >> $seqres.full
> > +_scratch_unmount
> > +
> > +# We only have one file in one directory (it's generally in AGI 0). So only
> > +# one AG has free inodes (XFS allocates inodes in chunks of 64), so the
> > +# AG which has the testfile, its freecount should not be 0.
> > +for ((agi=0; agi<agcount; agi++)); do
> > +	freecount=$(_scratch_xfs_get_metadata_field freecount "agi $agi")
> > +	if [ "$freecount" != "0" ]; then
> > +		break
> > +	fi
> > +done
> > +# Make sure we found the AG contains the testfile
> > +if [ $agi -gt $agcount ]; then
> > +	_fail "Can't find testfile in which AG"
> > +fi
> 
> Can't we figure out which AG the testfile inode is in from the inode
> number directly?

Sure, thanks for you told me how to do that:)

> 
> > +# Due to we only allocate 1 block for testfile, and this's the only one data
> > +# block we use. So we use single level inobt, So the ${agi}->root->recs[1]
> > +# should be the only one record points the chunk which contains testfile's
> > +# inode.
> > +# An exmaple of inode record is as below:
> > +#   recs[1] = [startino,freecount,free] 1:[1024,59,0xffffffffffffffe0]
> > +freecount=$(_scratch_xfs_get_metadata_field "recs[1].freecount" \
> > +					    "agi $agi" "addr root")
> > +fmask=$(_scratch_xfs_get_metadata_field "recs[1].free" "agi $agi" "addr root")
> > +
> > +# fmask shift right 1 bit, and freecount++, to mark testfile inode as free in
> > +# inobt. (But the inode itself isn't freed, it still has allocated block)
> > +freecount="$((freecount + 1))"
> > +fmask="$((fmask / 2))"
> 
> TBH I was expecting this to find testfile's inode number, set
> freecount=1, and then reset the freemask so that testfile is the only
> free inode in the chunk, thereby forcing(?) the next allocation to end
> up with testfile's inode and reproduce the crash.  Not sure why we're
> shifting right by one bit?
> 
> tldr: I'm confused :)

Hmmm... a little confused at here. Do you mean this:
  # stat -c %i /mnt/test/dir/testfile 
  1028
  # umount $dev
  # xfs_db -x $dev
  xfs_db> inode 1028
  xfs_db> convert inode 1028 agno
  0x0 (0)
  xfs_db> agi 0
  xfs_db> addr root
  xfs_db> p
  magic = 0x49414254
  level = 0
  numrecs = 1
  leftsib = null
  rightsib = null
  recs[1] = [startino,freecount,free] 1:[1024,59,0xffffffffffffffe0]
  xfs_db> write recs[1].startino 1028
  recs[1].startino = 1028
  xfs_db> write recs[1].freecount 1
  recs[1].freecount = 1
  xfs_db> write recs[1].free 1
  recs[1].free = 0x1
  xfs_db> q

But after mount this XFS again, and tried to do `touch /mnt/test/dir/newfile`,
I got this warning:

[47420.479191] XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_ialloc.c, line: 1156                                                                                          [45/9735]
[47420.520226] ------------[ cut here ]------------
[47420.543399] WARNING: CPU: 13 PID: 2267 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs]
....
[47421.791340] XFS (dm-2): Internal error XFS_WANT_CORRUPTED_GOTO at line 1156 of file fs/xfs/libxfs/xfs_ialloc.c.  Caller xfs_dialloc_ag+0x6e/0x360 [xfs]
....

Hmm... I'm confused.

> 
> > +_scratch_xfs_set_metadata_field "recs[1].freecount" "$freecount" \
> > +				"agi $agi" "addr root" >/dev/null
> > +_scratch_xfs_set_metadata_field "recs[1].free" "$fmask" \
> > +				"agi $agi" "addr root" >/dev/null
> > +
> > +# Mount again and create a new inode cover that inode we just 'freed' from inobt
> > +_scratch_mount
> > +$XFS_IO_PROG -fc 'pwrite 0 $blksz' -c fsync $SCRATCH_MNT/dir/newfile 2>&1 | \
> > +	grep -i "Structure needs cleaning" | _filter_scratch
> 
> How often does this fail to allocate the inode we've messed with?

Everytime in my test

Thanks,
Zorro.

> 
> --D
> 
> > +
> > +# filter a intentional internal errors
> > +_check_dmesg _filter_dmesg
> > +
> > +# success, all done
> > +status=0
> > +exit
> > diff --git a/tests/xfs/444.out b/tests/xfs/444.out
> > new file mode 100644
> > index 00000000..2daaf2fc
> > --- /dev/null
> > +++ b/tests/xfs/444.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 444
> > +SCRATCH_MNT/dir/newfile: Structure needs cleaning
> > diff --git a/tests/xfs/group b/tests/xfs/group
> > index e2397fe6..831f2cfa 100644
> > --- a/tests/xfs/group
> > +++ b/tests/xfs/group
> > @@ -441,3 +441,4 @@
> >  441 auto quick clone quota
> >  442 auto stress clone quota
> >  443 auto quick ioctl fsr
> > +444 auto quick
> > -- 
> > 2.14.3
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe fstests" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe fstests" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html