Re: [PATCH v2] fstests: btrfs: a new test case to verify a use-after-free bug

Filipe Manana <fdmanana@xxxxxxxxxx> · Mon, 26 Aug 2024 13:14:54 +0100

On Sat, Aug 24, 2024 at 11:31 AM Qu Wenruo <wqu@xxxxxxxx> wrote:
>
> [BUG]
> There is a use-after-free bug triggered very randomly by btrfs/125.
>
> If KASAN is enabled it can be triggered on certain setup.
> Or it can lead to crash.
>
> [CAUSE]
> The test case btrfs/125 is using RAID5 for metadata, which has a known
> RMW problem if the there is some corruption on-disk.
>
> RMW will use the corrupted contents to generate a new parity, losing the
> final chance to rebuild the contents.
>
> This is specific to metadata, as for data we have extra data checksum,
> but the metadata has extra problems like possible deadlock due to the
> extra metadata read/recovery needed to search the extent tree.
>
> And we know this problem for a while but without a better solution other
> than avoid using RAID56 for metadata:
>
> >   Metadata
> >       Do not use raid5 nor raid6 for metadata. Use raid1 or raid1c3
> >       respectively.
>
> Combined with the above csum tree corruption, since RAID5 is stripe
> based, btrfs needs to split its read bios according to stripe boundary.
> And after a split, do a csum tree lookup for the expected csum.

This is way too much unrelated stuff.
The problem may have been triggered sporadically by btrfs/125, but
there's no need to go into details on raid5 problems in btrfs.
Even because the use-after-free bug can be triggered without raid5,
just using raid0 like in the test case introduced by this patch.

>
> But if that csum lookup failed, in the error path btrfs doesn't handle
> the split bios properly and lead to double freeing of the original bio
> (the one containing the bio vectors).
>
> [NEW TEST CASE]
> Unlike the original btrfs/125, which is very random and picky to
> reproduce, introduce a new test case to verify the specific behavior by:
>
> - Create a btrfs with enough csum leaves
>   To bump the csum tree level, use the minimal nodesize possible (4K).
>   Writing 32M data which needs at least 8 leaves for data checksum
>
> - Find the last csum tree leave and corrupt it
>
> - Read the data many times until we trigger the bug or exit gracefully
>   With an x86_64 VM (which is never able to trigger btrfs/125 failure)
>   with KASAN enabled, it can trigger the KASAN report in just 4
>   iterations (the default iteration number is 32).
>
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> ---
> Changelog:
> v2:
> - Fix the wrong commit hash
>   The proper fix is not yet merged, the old hash is a place holder
>   copied from another test case but forgot to remove.
>
> - Minor wording update
>
> - Add to "dangerous" group
> ---
>  tests/btrfs/319     | 84 +++++++++++++++++++++++++++++++++++++++++++++
>  tests/btrfs/319.out |  2 ++
>  2 files changed, 86 insertions(+)
>  create mode 100755 tests/btrfs/319
>  create mode 100644 tests/btrfs/319.out
>
> diff --git a/tests/btrfs/319 b/tests/btrfs/319
> new file mode 100755
> index 00000000..4be2b50b
> --- /dev/null
> +++ b/tests/btrfs/319

There's already a btrfs/319 test case in for-next btw. This needs to
be renumbered.

> @@ -0,0 +1,84 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2024 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FS QA Test 319
> +#
> +# Make sure data csum lookup failure will not lead to double bio freeing
> +#
> +. ./common/preamble
> +_begin_fstest auto quick dangerous

Missing the "raid" group.

> +
> +_require_scratch
> +_fixed_by_kernel_commit xxxxxxxxxxxx \
> +       "btrfs: fix a use-after-free bug when hitting errors inside btrfs_submit_chunk()"
> +
> +# The final fs will have a corrupted csum tree, which will never pass fsck
> +_require_scratch_nocheck
> +_require_scratch_dev_pool 2
> +
> +# Use RAID0 for data to get bio splitted according to stripe boundary.
> +# This is required to trigger the bug.
> +_check_btrfs_raid_type raid0
> +
> +# This test goes 4K sectorsize and 4K nodesize, so that we easily create
> +# higher level of csum tree.
> +_require_btrfs_support_sectorsize 4096
> +
> +# The bug itself has a race window, run this many times to ensure triggering.
> +# On an x86_64 VM with KASAN enabled, normally it is triggered before the 10th run.
> +runtime=32

This is a confusing name because it actually means iterations in the
for loop below, and not a time duration.
I would suggest renaming it to "iterations" for example, or just get
rid of it since it's only used in the for loop's condition.

> +
> +_scratch_pool_mkfs "-d raid0 -m single -n 4k -s 4k" >> $seqres.full 2>&1
> +# This test requires data checksum to trigger the bug.
> +_scratch_mount -o datasum,datacow
> +
> +# For the smallest csum size (CRC32C) it's 4 bytes per 4K, writing 32M data
> +# will need 32K data checksum at minimal, which is at least 8 leaves.
> +_pwrite_byte 0xef 0 32m "$SCRATCH_MNT/foobar" > /dev/null
> +sync

What's this sync for?
We just do a unmount right after it, which makes it pointless and confusing.

> +_scratch_unmount
> +
> +# Search for the last leaf of the csum tree, that will be the target to destroy.
> +$BTRFS_UTIL_PROG inspect dump-tree -t csum $SCRATCH_DEV >> $seqres.full

Please use the full command name: inspect -> inspect-internal

> +target_bytenr=$($BTRFS_UTIL_PROG inspect dump-tree -t csum $SCRATCH_DEV | grep "leaf.*flags" | sort | tail -n1 | cut -f2 -d\ )

Same here.

Also, missing at the top a:

_require_btrfs_command inspect-internal dump-tree

Also we're passing the symbolic name "csum" to -t, which not all
versions of btrfs-progs support.
The support was added in btrfs-progs 4.5 (commit
69874af7b81519e40db9d92efa6beebee4220c63).

> +
> +if [ -z "$target_bytenr" ]; then
> +       _fail "unable to locate the last csum tree leave"

leave -> leaf

> +fi
> +
> +echo "bytenr of csum tree leave to corrupt: $target_bytenr" >> $seqres.full

leave -> leaf

> +
> +# Corrupt that csum tree block.
> +physical=$(_btrfs_get_physical "$target_bytenr" 1)
> +dev=$(_btrfs_get_device_path "$target_bytenr" 1)
> +
> +echo "physical bytenr: $physical" >> $seqres.full
> +echo "physical device: $dev" >> $seqres.full
> +
> +_pwrite_byte 0x00 "$physical" 4 "$dev" > /dev/null
> +
> +for (( i = 0; i < $runtime; i++ )); do
> +       echo "=== run $i/$runtime ===" >> $seqres.full
> +       _scratch_mount -o ro
> +       # Since the data is on RAID0, read bios will be split at the stripe
> +       # (64K sized) boundary. If csum lookup failed due to corrupted csum
> +       # tree, there is a race window that can lead to double bio freeing
> +       # (triggering KASAN at least).
> +       cat "$SCRATCH_MNT/foobar" &> /dev/null
> +       _scratch_unmount
> +
> +       # Manually check the dmesg for "BUG", and do not call _check_dmesg()
> +       # as it will clear 'check_dmesg' file and skips the final check after
> +       # the test.
> +       # For now just focus on the "BUG:" line from KASAN.
> +       if _check_dmesg_for "BUG" ; then
> +               _fail "Critical error(s) found in dmesg"
> +       fi

Why do the check manually? The check script calls _check_dmesg when a
test finishes and if it finds 'BUG:' there, it will make the test
fail.
So there's no need to do the unmount and call _check_dmesg_for.

Thanks.

> +done
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/319.out b/tests/btrfs/319.out
> new file mode 100644
> index 00000000..d40c929a
> --- /dev/null
> +++ b/tests/btrfs/319.out
> @@ -0,0 +1,2 @@
> +QA output created by 319
> +Silence is golden
> --
> 2.46.0
>
>