Re: [PATCH 1/1] xfs: check for COW overflows in i_delayed_blks

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Fri, 7 Jun 2019 11:19:26 -0700

On Fri, Jun 07, 2019 at 12:23:48PM -0400, Brian Foster wrote:
> On Fri, Jun 07, 2019 at 08:53:09AM -0700, Darrick J. Wong wrote:
> > On Fri, Jun 07, 2019 at 07:24:33AM -0400, Brian Foster wrote:
> > > On Thu, Jun 06, 2019 at 01:02:49PM -0700, Darrick J. Wong wrote:
> > > > On Thu, Jun 06, 2019 at 03:14:50PM -0400, Brian Foster wrote:
> > > > > On Tue, Jun 04, 2019 at 02:17:10PM -0700, Darrick J. Wong wrote:
> > > > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > > > 
> > > > > > With the new copy on write functionality it's possible to reserve so
> > > > > > much COW space for a file that we end up overflowing i_delayed_blks.
> > > > > > The only user-visible effect of this is to cause totally wrong i_blocks
> > > > > > output in stat, so check for that.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > > > ---
> > > > > >  tests/xfs/907     |  223 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > >  tests/xfs/907.out |    8 ++
> > > > > >  tests/xfs/group   |    1 
> > > > > >  3 files changed, 232 insertions(+)
> > > > > >  create mode 100755 tests/xfs/907
> > > > > >  create mode 100644 tests/xfs/907.out
> > > > > > 
> > > > > > 
> > > > > > diff --git a/tests/xfs/907 b/tests/xfs/907
> > > > > > new file mode 100755
> > > > > > index 00000000..d85f12da
> > > > > > --- /dev/null
> > > > > > +++ b/tests/xfs/907
> > > > > > @@ -0,0 +1,223 @@
> > > > > > +#! /bin/bash
> > > > > > +# SPDX-License-Identifier: GPL-2.0+
> > > > > > +# Copyright (c) 2019 Oracle, Inc.  All Rights Reserved.
> > > > > > +#
> > > > > > +# FS QA Test No. 907
> > > > > > +#
> > > > > > +# Try to overflow i_delayed_blks by setting the largest cowextsize hint
> > > > > > +# possible, creating a sparse file with a single byte every cowextsize bytes,
> > > > > > +# reflinking it, and retouching every written byte to see if we can create
> > > > > > +# enough speculative COW reservations to overflow i_delayed_blks.
> > > > > > +#
> > > > > > +seq=`basename $0`
> > > > > > +seqres=$RESULT_DIR/$seq
> > > > > > +echo "QA output created by $seq"
> > > > > > +
> > > > > > +here=`pwd`
> > > > > > +tmp=/tmp/$$
> > > > > > +status=1	# failure is the default!
> > > > > > +trap "_cleanup; exit \$status" 0 1 2 3 7 15
> > > > > > +
> > > > > > +_cleanup()
> > > > > > +{
> > > > > > +	cd /
> > > > > > +	test -n "$loop_mount" && $UMOUNT_PROG $loop_mount > /dev/null 2>&1
> > > > > > +	test -n "$loop_dev" && _destroy_loop_device $loop_dev
> > > > > > +	rm -rf $tmp.*
> > > > > > +}
> > > > > > +
> > > > > > +# get standard environment, filters and checks
> > > > > > +. ./common/rc
> > > > > > +. ./common/reflink
> > > > > > +. ./common/filter
> > > > > > +
> > > > > > +# real QA test starts here
> > > > > > +_supported_os Linux
> > > > > > +_supported_fs xfs
> > > > > > +_require_scratch_reflink
> > > > > > +_require_cp_reflink
> > > > > > +_require_loop
> > > > > > +_require_xfs_debug	# needed for xfs_bmap -c
> > > > > > +
> > > > > > +MAXEXTLEN=2097151	# cowextsize can't be more than MAXEXTLEN
> > > > > > +
> > > > > > +echo "Format and mount"
> > > > > > +_scratch_mkfs > "$seqres.full" 2>&1
> > > > > > +_scratch_mount
> > > > > > +
> > > > > > +# Create a huge sparse filesystem on the scratch device because that's what
> > > > > > +# we're going to need to guarantee that we have enough blocks to overflow in
> > > > > > +# the first place.  We need to have at least enough free space on that huge fs
> > > > > > +# to handle one written block every MAXEXTLEN blocks and to reserve 2^32 blocks
> > > > > > +# in the COW fork.  There needs to be sufficient space in the scratch
> > > > > > +# filesystem to handle a 256M log, all the per-AG metadata, and all the data
> > > > > > +# written to the test file.
> > > > > > +#
> > > > > > +# Worst case, a 64k-block fs needs to be about 300TB.  Best case, a 1k block
> > > > > > +# filesystem needs ~5TB.  For the most common 4k case we only need a ~20TB fs.
> > > > > > +#
> > > > > > +# nr_cows is the number of writes we make to the filesystem.
> > > > > > +# blks_needed is the number of free blocks we need in the fs to trigger the
> > > > > > +#     overflow.  The destination file needs to have more than 2^32 blocks
> > > > > > +#     reserved for COW, and the source file needs to have 1 block written every
> > > > > > +#     MAXEXTLEN blocks (i.e. 2^32/MAXEXTLEN blocks) to set up the destination
> > > > > > +#     file.
> > > > > > +# loop_file_sz is the size of the filesystem we have to create to produce the
> > > > > > +#     overflow problems.  This is 20% more than blks_needed, and rounded to the
> > > > > > +#     nearest 512b so losetup doesn't whine.
> > > > > > +# est_nr_ags is a guess at the AG count, assuming that we'll have one AG per
> > > > > > +#     terabyte (which assumes that we need a multi-terabyte filesystem for this
> > > > > > +#     test).  We assume we'll need 16 blocks for AG metadata and 128K for inode
> > > > > > +#     chunks.
> > > > > > +# space_reqd_kb is the size of the huge sparse filesystem --
> > > > > > +#     256M for the log, per-AG metadata, and 10% more space than however much
> > > > > > +#     we will write to both test files.
> > > > > > +blksz=$(_get_file_block_size "$SCRATCH_MNT")
> > > > > > +nr_cows="$(( ((2 ** 32) / MAXEXTLEN) + 100 ))"
> > > > > > +blks_needed="$(( nr_cows * (1 + MAXEXTLEN) ))"
> > > > > > +loop_file_sz="$(( ((blksz * blks_needed) * 12 / 10) / 512 * 512 ))"
> > > > > > +est_nr_ags="$(( (loop_file_sz / (2 ** 40)) + 1 ))"
> > > > > > +est_ag_space_kb="$(( (est_nr_ags * 16 * blksz / 1024) + 128 ))"
> > > > > > +est_file_space_kb="$(( (((nr_cows * blksz) * 11 / 10) / 1024) ))"
> > > > > > +space_reqd_kb="$(( 262144 + est_ag_space_kb + 2 * est_file_space_kb ))"
> > > > > > +cat >> $seqres.full << ENDL
> > > > > > +blksz: $blksz
> > > > > > +nr_cows: $nr_cows
> > > > > > +blks_needed: $blks_needed
> > > > > > +loop_file_sz: $loop_file_sz
> > > > > > +est_nr_ags: $est_nr_ags
> > > > > > +est_ag_space_kb: $est_ag_space_kb
> > > > > > +est_file_space_kb: $est_file_space_kb
> > > > > > +space_reqd_kb: $space_reqd_kb
> > > > > > +ENDL
> > > > > > +_require_fs_space $SCRATCH_MNT $space_reqd_kb
> > > > > > +
> > > > > 
> > > > > This seems like it would be simpler to just create the worst case sparse
> > > > > fs (say 300TB or so) then make sure the scratch device has enough free
> > > > > blocks to accommodate the number of COWs (plus slop) that need to occur
> > > > > in the loop fs to overflow the counter. I don't have a 64k box handy,
> > > > > but the metadata size difference between a sparse 20TB fs and 300TB fs
> > > > > over a 4k scratch fs is only a few MB. Hm?
> > > > 
> > > > <shrug> Originally it did just hardcode 300T, and Eryu complained that
> > > > xfs_db would OOM when it tried to deal with a 300T filesystem.  I
> > > > changed the test to avoid xfs_db, but then it occurred to me (running on
> > > > a slow arm64 box) that the mkfs and xfs_repair runtimes could be cut
> > > > down considerably if the loop fs was only as large as it needed to be.
> > > > 
> > > 
> > > Interesting.. do you recall how much of a difference it made? I don't
> > > see much more than a second or two difference on my test vm, but I
> > > haven't tried any slow arm or 64k page boxes.
> > 
> > It was ~15s on x64 (20T), ~180s on arm64 (20T) and ~300s w/ 64k blocks.
> > Granted the slow arm box is a rpi3b+ with 1G RAM in 64bit mode which is
> > probably very stupid.
> > 
> > Oh, hey, qemu arm64 is about half as slow.
> > 
> 
> Heh, Ok.

Granted it also fails to boot 40% of the time because <unknown>...

...and sometimes it only boots if I bang Enter on the serial console.

> > > Even then, that could be accomplished with a simple calculation of the
> > > 2^32 block requirement multiplied against the block size plus some slop
> > > value like 1GB without significantly changing timing. I.e., the whole
> > > big hunk above could be reduced to something like:
> > > 
> > > # Calculate the fs size based on block size and the number of delalloc blocks we
> > > # must be able to reserve to overflow. Add some room to account for fs
> > > # metadata overhead and round to 512b so losetup doesn't complain.
> > > nr_cows="$(( ( (2 ** 32) / MAXEXTLEN ) + 100 ))"
> > > loop_file_sz=$(( (nr_cows * MAXEXTLEN * blksz * 12 / 10) / 512 * 512 ))
> > > 
> > > From there, we could either push the _require_fs_space() down after the
> > > mkfs:
> > > 
> > > # Now that we've formatted the filesystem, make sure the scratch fs has enough
> > > # room for $nr_cows blocks in two files, again allowing for some overhead.
> > > _require_fs_space $SCRATCH_MNT $(( ( 2 * nr_cows * blksz * 12 / 10 ) / 1024 ))
> > > 
> > > Or if that's too hacky, just enforce a worst case value for the fs
> > > metadata portion of the required space. As mentioned, the difference I
> > > see between 20T and 300T on a 4k scratch dev is ~6MB. fiemap shows most
> > > of that made up as 40 512b extents, which is 20kb. Extrapolate that out
> > > to a 64k scratch fs and we come in under 20MB. ISTM we could just take
> > > add something like 500MB/1GB to the _require_fs_space() calculation
> > > above to account for the 256MB log + all other fs structure and still
> > > allow the test to run in most cases.
> > > 
> > > Hmmmm.. unless I'm miscalculating, even the nr_cows portion of the
> > > required space only comes out to ~320MB with a 64k scratch dev. Perhaps
> > > we could skip all of this (except for the sizing of the loop file) and
> > > just '_require_fs_space <1GB>' at the top of the test with a couple
> > > lines of comment to document the worst case calculation. That leaves
> > > plenty of extra space and is still small enough that it likely won't
> > > filter out in most test envs. I also don't think we need to push
> > > anything more than the resulting mkfs output to the $seqres.full file,
> > > so altogether that allows us to remove almost a screen full of
> > > calculation cruft from the test (if you account for the huge comment
> > > required to document the calculations).
> > 
> > In practice I noticed that the space usage usually came out around
> > ~780MB for the 300T filesystem, with rmap and reflink both enabled.  I'm
> > tempted just to assume that 1G of space on the scratch fs is fine
> > though.  Maybe I'll just hardwire it to 1.2G and leave it.
> > 
> 
> That includes the log as well? If so, 1.2G sounds reasonable to me.

Yep. du -ms $TEST_DIR/a.img

--D

> Brian
> 
> > > > > > +loop_file=$SCRATCH_MNT/a.img
> > > > > > +loop_mount=$SCRATCH_MNT/a
> > > > > > +$XFS_IO_PROG -f -c "truncate $loop_file_sz" $loop_file
> > > > > > +loop_dev=$(_create_loop_device $loop_file)
> > > > > > +
> > > > > > +# Now we have to create the source file.  The goal is to overflow a 32-bit
> > > > > > +# i_delayed_blks, which means that we have to create at least that many delayed
> > > > > > +# allocation block reservations.  Take advantage of the fact that a cowextsize
> > > > > > +# hint causes creation of large speculative delalloc reservations in the cow
> > > > > > +# fork to reduce the amount of work we have to do.
> > > > > > +#
> > > > > > +# The maximum cowextsize can only be set to MAXEXTLEN fs blocks on a filesystem
> > > > > > +# whose AGs each have more than MAXEXTLEN * 2 blocks.  This we can do easily
> > > > > > +# with a multi-terabyte filesystem, so start by setting up the hint.  Note that
> > > > > > +# the current fsxattr interface specifies its u32 cowextsize hint in units of
> > > > > > +# bytes and therefore can't handle MAXEXTLEN * blksz on most filesystems, so we
> > > > > > +# set it via mkfs because mkfs takes units of fs blocks, not bytes.
> > > > > > +
> > > > > > +_mkfs_dev -d cowextsize=$MAXEXTLEN -l size=256m $loop_dev >> $seqres.full
> > > > > > +mkdir $loop_mount
> > > > > > +mount $loop_dev $loop_mount
> > > > > > +
> > > > > > +echo "Create crazy huge file"
> > > > > > +huge_file="$loop_mount/a"
> > > > > > +touch "$huge_file"
> > > > > > +blksz=$(_get_file_block_size "$loop_mount")
> > > > > > +extsize_bytes="$(( MAXEXTLEN * blksz ))"
> > > > > > +
> > > > > > +# Make sure it actually set a hint.
> > > > > > +curr_cowextsize_str="$($XFS_IO_PROG -c 'cowextsize' "$huge_file")"
> > > > > > +echo "$curr_cowextsize_str" >> $seqres.full
> > > > > > +cowextsize_bytes="$(echo "$curr_cowextsize_str" | sed -e 's/^.\([0-9]*\).*$/\1/g')"
> > > > > > +test "$cowextsize_bytes" -eq 0 && echo "could not set cowextsize?"
> > > > > > +
> > > > > > +# Now we have to seed the file with sparse contents.  Remember, the goal is to
> > > > > > +# create a little more than 2^32 delayed allocation blocks in the COW fork with
> > > > > > +# as little effort as possible.  We know that speculative COW preallocation
> > > > > > +# will create MAXEXTLEN-length reservations for us, so that means we should
> > > > > > +# be able to get away with touching a single byte every extsize_bytes.  We
> > > > > > +# do this backwards to avoid having to move EOF.
> > > > > > +seq $nr_cows -1 0 | while read n; do
> > > > > > +	off="$((n * extsize_bytes))"
> > > > > > +	$XFS_IO_PROG -c "pwrite $off 1" "$huge_file" > /dev/null
> > > > > > +done
> > > > > > +
> > > > > > +echo "Reflink crazy huge file"
> > > > > > +_cp_reflink "$huge_file" "$huge_file.b"
> > > > > > +
> > > > > > +# Now that we've shared all the blocks in the file, we touch them all again
> > > > > > +# to create speculative COW preallocations.
> > > > > > +echo "COW crazy huge file"
> > > > > > +seq $nr_cows -1 0 | while read n; do
> > > > > > +	off="$((n * extsize_bytes))"
> > > > > > +	$XFS_IO_PROG -c "pwrite $off 1" "$huge_file" > /dev/null
> > > > > > +done
> > > > > > +
> > > > > > +# Compare the number of blocks allocated to this file (as reported by stat)
> > > > > > +# against the number of blocks that are in the COW fork.  If either one is
> > > > > > +# less than 2^32 then we have evidence of an overflow problem.
> > > > > > +echo "Check crazy huge file"
> > > > > > +allocated_stat_blocks="$(stat -c %b "$huge_file")"
> > > > > > +stat_blksz="$(stat -c %B "$huge_file")"
> > > > > > +allocated_fsblocks=$(( allocated_stat_blocks * stat_blksz / blksz ))
> > > > > > +
> > > > > > +# Make sure we got enough COW reservations to overflow a 32-bit counter.
> > > > > > +
> > > > > > +# Return the number of delalloc & real blocks given bmap output for a fork of a
> > > > > > +# file.  Output is in units of 512-byte blocks.
> > > > > > +count_fork_blocks() {
> > > > > > +	$AWK_PROG "
> > > > > > +{
> > > > > > +	if (\$3 == \"delalloc\") {
> > > > > > +		x += \$4;
> > > > > > +	} else if (\$3 == \"hole\") {
> > > > > > +		;
> > > > > > +	} else {
> > > > > > +		x += \$6;
> > > > > > +	}
> > > > > > +}
> > > > > > +END {
> > > > > > +	print(x);
> > > > > > +}
> > > > > > +"
> > > > > > +}
> > > > > > +
> > > > > > +# Count the number of blocks allocated to a file based on the xfs_bmap output.
> > > > > > +# Output is in units of filesystem blocks.
> > > > > > +count_file_fork_blocks() {
> > > > > > +	local tag="$1"
> > > > > > +	local file="$2"
> > > > > > +	local args="$3"
> > > > > > +
> > > > > > +	$XFS_IO_PROG -c "bmap $args -l -p -v" "$huge_file" > $tmp.extents
> > > > > > +	echo "$tag fork map" >> $seqres.full
> > > > > > +	cat $tmp.extents >> $seqres.full
> > > > > > +	local sectors="$(count_fork_blocks < $tmp.extents)"
> > > > > > +	echo "$(( sectors / (blksz / 512) ))"
> > > > > > +}
> > > > > > +
> > > > > > +cowblocks=$(count_file_fork_blocks cow "$huge_file" "-c")
> > > > > > +attrblocks=$(count_file_fork_blocks attr "$huge_file" "-a")
> > > > > > +datablocks=$(count_file_fork_blocks data "$huge_file" "")
> > > > > > +
> > > > > > +# Did we create more than 2^32 blocks in the cow fork?
> > > > > > +echo "datablocks is $datablocks" >> $seqres.full
> > > > > > +echo "attrblocks is $attrblocks" >> $seqres.full
> > > > > > +echo "cowblocks is $cowblocks" >> $seqres.full
> > > > > > +test "$cowblocks" -lt $((2 ** 32)) && \
> > > > > > +	echo "cowblocks (${cowblocks}) should be more than 2^32!"
> > > > 
> > > > This part checks that the test did what we expect.
> > > > 
> > > 
> > > Ok, I guess I'd just tweak the one liner comments to explain the purpose
> > > of each of these checks rather than reiterate the code. E.g., for the
> > > check above:
> > > 
> > > # did cow speculative prealloc reserve enough to overflow?
> > > 
> > > > > > +
> > > > > > +# Does stat's block allocation count exceed 2^32?
> > > 
> > > # detect delalloc count overflow via the stat data
> > > 
> > > > > > +echo "stat blocks is $allocated_fsblocks" >> $seqres.full
> > > > > > +test "$allocated_fsblocks" -lt $((2 ** 32)) && \
> > > > > > +	echo "stat blocks (${allocated_fsblocks}) should be more than 2^32!"
> > > > 
> > > > This part detects the problem in the incore state.
> > > > 
> > > > > > +# Finally, does st_blocks match what we computed from the forks?
> > > 
> > > # sanity check the values computed from the forks
> > 
> > Comments added.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > > > +expected_allocated_fsblocks=$((datablocks + cowblocks + attrblocks))
> > > > > > +echo "expected stat blocks is $expected_allocated_fsblocks" >> $seqres.full
> > > > > > +
> > > > > > +_within_tolerance "st_blocks" $allocated_fsblocks $expected_allocated_fsblocks 2% -v
> > > > 
> > > > And this one on is a second sanity check that the test did roughly what
> > > > we expected.
> > > > 
> > > > > > +
> > > > > 
> > > > > Similar question with the post-processing stuff... how much of this is
> > > > > to detect the problem vs. determine the test sequence did what we
> > > > > expect?
> > > > 
> > > > (Answered above.)
> > > > 
> > > > > Does the fs check below report corruption or is it purely
> > > > > in-core state that ends up broken?
> > > > 
> > > > It's only the incore state that ends up broken.
> > > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > +echo "Test done"
> > > > > > +# Quick check the large sparse fs, but skip xfs_db because it doesn't scale
> > > > > > +# well on a multi-terabyte filesystem.
> > > > > > +LARGE_SCRATCH_DEV=yes _check_xfs_filesystem $loop_dev none none
> > > > > > +
> > > > > > +# success, all done
> > > > > > +status=0
> > > > > > +exit
> > > > > > diff --git a/tests/xfs/907.out b/tests/xfs/907.out
> > > > > > new file mode 100644
> > > > > > index 00000000..cc07d659
> > > > > > --- /dev/null
> > > > > > +++ b/tests/xfs/907.out
> > > > > > @@ -0,0 +1,8 @@
> > > > > > +QA output created by 907
> > > > > > +Format and mount
> > > > > > +Create crazy huge file
> > > > > > +Reflink crazy huge file
> > > > > > +COW crazy huge file
> > > > > > +Check crazy huge file
> > > > > > +st_blocks is in range
> > > > > > +Test done
> > > > > > diff --git a/tests/xfs/group b/tests/xfs/group
> > > > > > index ffe4ae12..e528c559 100644
> > > > > > --- a/tests/xfs/group
> > > > > > +++ b/tests/xfs/group
> > > > > > @@ -504,3 +504,4 @@
> > > > > >  504 auto quick mkfs label
> > > > > >  505 auto quick spaceman
> > > > > >  506 auto quick health
> > > > > > +907 clone
> > > > > >