Re: FIDEDUPERANGE claims to succeed for non-identical files

Amir Goldstein <amir73il@xxxxxxxxx> · Sat, 4 Feb 2023 09:00:25 +0200

On Sat, Feb 4, 2023 at 4:23 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>
> On Wed, Jan 04, 2023 at 08:36:16PM +0100, Zbigniew Halas wrote:
> > On Wed, Jan 4, 2023 at 6:25 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> > >
> > > On Thu, Dec 22, 2022 at 03:41:30PM +0100, Zbigniew Halas wrote:
> > > > On Thu, Dec 22, 2022 at 9:25 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > > > >
> > > > > Thanks for the analysis.
> > > > > Would you be interested in trying to fix the bug and writing a test?
> > > > > I can help if you would like.
> > > >
> > > > I can give it a try unless it turns out that some deep VFS changes are
> > > > required, but let's try to narrow down the reasonable API semantics
> > > > first.
> > > >
> > > > > It's hard to follow all the changes since
> > > > > 54dbc1517237 ("vfs: hoist the btrfs deduplication ioctl to the vfs")
> > > > > in v4.5, but it *looks* like this behavior might have been in btrfs,
> > > > > before the ioctl was promoted to vfs.. not sure.
> > > > >
> > > > > We have fstests coverage for the "good" case of same size src/dst
> > > > > (generic/136), but I didn't find a test for the non-same size src/dst.
> > > > >
> > > > > In any case, vfs_dedupe_file_range_one() and ->remap_file_range()
> > > > > do not even have an interface to return the actual bytes_deduped,
> > >
> > > The number of bytes remapped is passed back via the loff_t return value
> > > of ->remap_file_range.  If CAN_SHORTEN is set, the VFS and the fs
> > > implementation are allowed to reduce the @len parameter as much as they
> > > want.  TBH I'm mystified why the original btrfs dedupe ioctl wouldn't
> > > allow deduplication of common prefixes, which means that len only gets
> > > shortened to avoid weird problems when dealing with eof being in the
> > > middle of a block.
> > >
> > > (Or not, since there's clearly a bug.)
> > >
> > > > > so I do not see how any of the REMAP_FILE_CAN_SHORTEN cases
> > > > > are valid, regardless of EOF.
> > > >
> > > > Not sure about this, it looks to me that they are actually returning
> > > > the number of bytes deduped, but the value is not used, but maybe I'm
> > > > missing something.
> > > > Anyway I think there are valid cases when REMAP_FILE_CAN_SHORTEN makes sense.
> > > > For example if a source file content is a prefix of a destination file
> > > > content and we want to dedup the whole range of the source file
> > > > without REMAP_FILE_CAN_SHORTEN,
> > > > then the ioctl will only succeed when the end of the source file is at
> > > > the block boundary, otherwise it will just fail. This will render the
> > > > API very inconsistent.
> > >
> > > <nod> I'll try to figure out where the len adjusting went wrong here and
> > > report back.
> >
> > Hey Darrick,
> >
> > Len adjustment happens in generic_remap_checks, specifically here:
> >     if (pos_in + count == size_in &&
> >         (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) {
> >         bcount = ALIGN(size_in, bs) - pos_in;
> >     } else {
> >         if (!IS_ALIGNED(count, bs))
> >             count = ALIGN_DOWN(count, bs);
> >         bcount = count;
> >     }
> > the problem is that in this particular case length is set to zero, and
> > it makes generic_remap_file_range_prep to succeed.
> > I think that 2 things are needed to make this API behave reasonably:
> > * always use the original length for the file content comparison, not
> > the truncated one (there currently is also a short circuit for zero
> > length that skips comparison, this would need to be revisited as well)
> > - otherwise we may report successful deduplication, despite the ranges
> > being clearly different and only the truncated ranges being the same.
> > * report real, possibly truncated deduplication length to the userland
>
> I was about to send in a patch when git blame told me that Linus was the
> last person to change the last line of this chunk:
>
>                 deduped = vfs_dedupe_file_range_one(file, off, dst_file,
>                                                     info->dest_offset, len,
>                                                     REMAP_FILE_CAN_SHORTEN);
>                 if (deduped == -EBADE)
>                         info->status = FILE_DEDUPE_RANGE_DIFFERS;
>                 else if (deduped < 0)
>                         info->status = deduped;
>                 else
>                         info->bytes_deduped = len;
>
> And then I remembered that I had totally repressed my entire memory of
> this unpleasant discussion wherein I got the short end of the stick not
> because I'd designed the weird btrfs ioctl that eventually became
> FIDEDUPERANGE but because I was the last person to make any real
> changes:
> https://lore.kernel.org/all/a7c93559-4ba1-df2f-7a85-55a143696405@xxxxxxxxxxxxxxx/
>
> Originally, btrfs returned the request length, but also didn't do any
> checking or operation length reduction to prevent people from remapping
> a partial EOF block into the middle of a file:
> https://elixir.bootlin.com/linux/v3.19/source/fs/btrfs/ioctl.c#L3018
>
> Dave and I spent a long time adapting XFS to this interface and hoisted
> the generic(ish) parts of the btrfs code to the VFS.  The VFS support
> code returned the number of bytes remapped, though at that point either
> the original request succeeded or it didn't, even if doing so set up
> data corruption:
> https://elixir.bootlin.com/linux/v4.5/source/fs/read_write.c#L1655
>
> Miklos went and committed this change containing no description of why
> it was made.  This patch (AFAICT) was never sent to fsdevel and does not
> seem to have been reviewed by anybody.  The patch sets up the current
> broken behavior where the kernel can dedupe fewer bytes than was asked
> for, yet return the original request length:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5740c99e9d30b81fcc478797e7215c61e241f44e
>
> Five months later I noticed and fixed the problems with remapping
> partial eof blocks into the middle of files and fixed it, but none of us
> noticed there was a logic bomb lurking in bytes_deduped:
> https://lore.kernel.org/linux-fsdevel/153981625504.5568.2708520119290577378.stgit@magnolia/
>
> A different person wrote a test for sub-block dedupe, but even he failed
> to notice the broken behavior:
> https://lore.kernel.org/fstests/20181105111445.11870-1-fdmanana@xxxxxxxxxx/
>
> Four years later (last summer), someone produced an attempt to fix this
> particular weird discrepancy:
> https://lore.kernel.org/all/b5805118-7e56-3d43-28e9-9e0198ee43f3@xxxxxxxxxxxxxxx/
>
> Which then Dave asked for a revert because it broke generic/517:
> https://lore.kernel.org/all/20220714223238.GH3600936@xxxxxxxxxxxxxxxxxxx/
>
> Since then, AFAICT there's been no followup to "We just need a bit of
> time to look at the various major dedupe apps and check that they still
> do the right thing w.r.t. proposed change."  The manpage for
> FIDEDUPERANGE says that bytes_deduped contains the number of bytes
> actually remapped, but that doesn't matter because Linus only cares
> about past kernel behavior.  That behavior is now a mess, since every
> LTS kernel since 4.19 has been over-reporting the number of bytes
> deduped.
>

Wow! What a story.

> Unfortunately, while *I* think the correct behavior is to report bytes
> remapped even if that quantity becomes zero due to alignment issues, we
> can't report zero here because duperemove will go into an infinite loop
> because the author did the usual "len += bytes_deduped" without checking
> for a zero length.  This causes generic/561 to run forever.  One could
> argue that we could return "extents differ" in that case, but then that
> breaks generic/517 and generic/182 which aren't expecting that behavior
> either.

But the bug report that started this thread is not about reporting actual vs.
requested dedupe size - it is about reporting that data in two ranges is the
same when it really differs - that could lead to very real data corruption.

Can we solve that by passing the requested len (and not the shotend len)
to {vfs,dax}_dedupe_file_range_compare() and then go on actually deduping
only the shortened len? (or return in case of shortened len == 0).

It does not look like this would break neither generic/517 nor generic/182
unless I am missing something?

>
> It's probably just time for us to just burn FIEDEDUPERANGE to the ground
> and for me to convert another of my other hobbies into my profession.
>

Have you considered a git historian career for retirement? ;)

Thanks,
Amir.