Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 20, 2018 at 2:09 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> [cc linux-fsdevel now, too]
>
> On Mon, Aug 20, 2018 at 09:11:26AM +1000, Dave Chinner wrote:
>> [cc linux-xfs@xxxxxxxxxxxxxxx]
>>
>> On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdmanana@xxxxxxxxxx wrote:
>> > From: Filipe Manana <fdmanana@xxxxxxxx>
>> >
>> > Test that deduplication of an entire file that has a size that is not
>> > aligned to the filesystem's block size into a different file does not
>> > corrupt the destination's file data.
>
> Ok, I've looked at this now. My first question is where did all the
> magic offsets in this test come from? i.e. how was this bug
> found and who is it affecting?

I found it myself. I'm not aware of any users or applications affected by it.

>
>> > This test is motivated by a bug found in Btrfs which is fixed by the
>> > following patch for the linux kernel:
>> >
>> >   "Btrfs: fix data corruption when deduplicating between different files"
>> >
>> > XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
>> > with the same corruption as in Btrfs - some bytes of a block get replaced
>> > with zeroes after the deduplication.
>>
>> Filipe, in future can please report XFS bugs you find to the XFS
>> list the moment you find them. We shouldn't ever find out about a
>> data corruption bug we need to fix via a "oh, by the way" comment in
>> a commit message for a regression test....
>
> This becomes much more relevant because of what I've just found....
>
> .....
>
>> > +# The first byte with a value of 0xae starts at an offset (2518890) which is not
>> > +# a multiple of the block size.
>> > +$XFS_IO_PROG -f \
>> > +   -c "pwrite -S 0x6b 0 2518890" \
>> > +   -c "pwrite -S 0xae 2518890 102398" \
>> > +   $SCRATCH_MNT/foo | _filter_xfs_io
>> > +
>> > +# Create a second file with a length not aligned to the block size, whose bytes
>> > +# all have the value 0x6b, so that its extent(s) can be deduplicated with the
>> > +# first file.
>> > +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | _filter_xfs_io
>> > +
>> > +# The file is filled with bytes having the value 0x6b from offset 0 to offset
>> > +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
>> > +echo "File content before deduplication:"
>> > +od -t x1 $SCRATCH_MNT/foo
>
> Please use "od -Ad -t x1 <file>" so the file offsets reported by od
> match the offsets used in the test (i.e. in decimal, not octal).

Will do, in the next test version after agreement on the fix.

>
>> > +
>> > +# Now deduplicate the entire second file into a range of the first file that
>> > +# also has all bytes with the value 0x6b. The destination range's end offset
>> > +# must not be aligned to the block size and must be less then the offset of
>> > +# the first byte with the value 0xae (byte at offset 2518890).
>> > +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" $SCRATCH_MNT/foo \
>> > +   | _filter_xfs_io
>
> Ok, now it gets fun. dedupe to non-block aligned rtanges is supposed
> to be rejected by the kernel in vfs_clone_file_prep_inodes(). i.e
> this check:
>
>         /* Only reflink if we're aligned to block boundaries */
>         if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
>             !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
>                 return -EINVAL;
>
> And it's pretty clear that a length of 557771 is not block aligned
> (being an odd number).
>
> So why was this dedupe request even accepted by the kernel? Well,
> I think there's a bug in the check just above this:
>
>         /* If we're linking to EOF, continue to the block boundary. */
>         if (pos_in + *len == isize)
>                 blen = ALIGN(isize, bs) - pos_in;
>         else
>                 blen = *len;

Yes, btrfs, for the dedupe call, also has its own place where it does
the same thing,
at fs/btrfs/ioctl.c:extent_same_check_offsets().
And that's precisely what made me suspicious about it, together with
what you note below about the call to btrfs_cmp_data() using the
original, unaligned, length.

However, I just ran the same test using reflink and not dedupe and the
same problem happens. In earlier versions of the test/debugging I
either did not notice
or made some mistake because I hadn't seen the same problem for the
reflink/clone operation, and since we only do that extra rounding in
the btrfs dedupe code path,
and not on the clone one, I took the conclusion that only dedupe was
affected, but that is also done in the VFS as you just pointed.
I only noticied XFS also failed at the last moment, when I had the
test case complete.

>
> basically, when the "in" file dedupe/reflink range is to EOF, it
> expands the range to include /all/ of the block that contains the
> EOF byte. IOWs, it now requests us to dedupe /undefined data beyond
> EOF/. But when we go to compare the data in these ranges, it
> truncates the comparison to the length that the user asked for
> (i.e. *len) and not the extended block length.
>
> IOWs, it doesn't compare the bytes beyond EOF in the source block to
> the data in the destination block it would replace, and so doesn't
> fail the compare like it should.
>
> And, well, btrfs has the same bug. extent_same_check_offsets()
> extends the range for alignment so it passes alignment checks, but
> then /explicitly/ uses the original length for the data compare
> and dedupe. i.e:
>
>        /* pass original length for comparison so we stay within i_size */
>         ret = btrfs_cmp_data(olen, cmp);
>         if (ret == 0)
>                 ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, 1);
>
> This is what we should see if someone tried to dedupe the EOF block
> of a file:
>
> generic/505     - output mismatch (see /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad)
>     --- tests/generic/505.out   2018-08-20 09:36:58.449015709 +1000
>     +++ /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad    2018-08-20 10:45:47.245482160 +1000
>     @@ -13,8 +13,7 @@
>      *
>      2621280 ae ae ae ae ae ae ae ae
>      2621288
>     -deduped 557771/557771 bytes at offset 1957888
>     -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>     +XFS_IOC_FILE_EXTENT_SAME: Invalid argument
>      File content after deduplication and before unmounting:
>     ...
>     (Run 'diff -u tests/generic/505.out /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad'  to see the entire diff)
>
> i.e. the dedupe request should fail as we cannot dedupe the EOF
> block.
>
> The patch below does this for the VFS dedupe code. it's not a final
> patch, it's just a demonstration of how this should never have got
> past the range checks. Open questions:
>
>         - is documenting rejection on request alignment grounds
>           (i.e. EINVAL) in the man page sufficient for app
>           developers to understand what is going on here?

Maybe. No idea if that would "break" existing applications and use cases.

>         - should we just round down the EOF dedupe request to the
>           block before EOF so dedupe still succeeds?

That was my idea because dedupe is allowed to deduplicate less then
what is requested.

>         - should file cloning (i.e. reflink) be allow allowing the
>           EOF block to be shared somewhere inside EOF in the
>           destination file? That's stale data exposure, yes?

No, but for cloning I'm not sure which approach is better, to round
down or reject with -EINVAL.

>         - should clone only allow sharing of the EOF block if it's a
>           either a full file clone or a matching range clone and
>           isize is the same for both src and dest files?

I think it should.

cheers

>
> Discuss.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
>
>
> [RFC] vfs: fix data corruption w/ unaligned dedupe ranges
>
> From: Dave Chinner <dchinner@xxxxxxxxxx>
>
> Exposed by fstests generic/505 on XFS, caused by extending the
> bblock match range to include the partial EOF block, but then
> allowing unknown data beyond EOF to be considered a "match" to data
> in the destination file because the comparison is only made to the
> end of the source file. This corrupts the destination file when the
> source extent is shared with it.
>
> Open questions:
>
>         - is documenting rejection on request alignment grounds
>           (i.e. EINVAL) in the man page sufficient for app
>           developers to understand what is going on here?
>         - should we just silently round down the EOF dedupe request
>           to the block before EOF so dedupe still succeeds?
>         - should file cloning (i.e. reflink) be allow allowing the
>           EOF block to be shared somewhere inside EOF in the
>           destination file? That's stale data exposure, yes?
>         - should clone only allow sharing of the EOF block if it's a
>           either a full file clone or a matching range clone and
>           isize is the same for both src and dest files?
>
> Btrfs also has the same bug in extent_same_check_offsets() and
> btrfs_extent_same_range() that is not addressed by this patch.
>
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/read_write.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 153f8f690490..3844359a4597 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1768,8 +1768,17 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>                         return -EINVAL;
>         }
>
> -       /* If we're linking to EOF, continue to the block boundary. */
> -       if (pos_in + *len == isize)
> +       /* Reflink allows linking to EOF, so round the length up to allow us to
> +        * shared the last block in the file. We don't care what is beyond the
> +        * EOF point in the block that gets shared, as we can't access it and
> +        * attempts to extent the file will break the sharing.
> +        *
> +        * For dedupe, sharing the EOF block is illegal because the bytes beyond
> +        * EOF are undefined (i.e. not readable) and so can't be deduped. Hence
> +        * we do not extend/align the length and instead let the alignment
> +        * checks below reject it.
> +        */
> +       if (!is_dedupe && pos_in + *len == isize)
>                 blen = ALIGN(isize, bs) - pos_in;
>         else
>                 blen = *len;



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux