On Wed, Jan 04, 2023 at 08:36:16PM +0100, Zbigniew Halas wrote: > On Wed, Jan 4, 2023 at 6:25 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > > > On Thu, Dec 22, 2022 at 03:41:30PM +0100, Zbigniew Halas wrote: > > > On Thu, Dec 22, 2022 at 9:25 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > > > > > Thanks for the analysis. > > > > Would you be interested in trying to fix the bug and writing a test? > > > > I can help if you would like. > > > > > > I can give it a try unless it turns out that some deep VFS changes are > > > required, but let's try to narrow down the reasonable API semantics > > > first. > > > > > > > It's hard to follow all the changes since > > > > 54dbc1517237 ("vfs: hoist the btrfs deduplication ioctl to the vfs") > > > > in v4.5, but it *looks* like this behavior might have been in btrfs, > > > > before the ioctl was promoted to vfs.. not sure. > > > > > > > > We have fstests coverage for the "good" case of same size src/dst > > > > (generic/136), but I didn't find a test for the non-same size src/dst. > > > > > > > > In any case, vfs_dedupe_file_range_one() and ->remap_file_range() > > > > do not even have an interface to return the actual bytes_deduped, > > > > The number of bytes remapped is passed back via the loff_t return value > > of ->remap_file_range. If CAN_SHORTEN is set, the VFS and the fs > > implementation are allowed to reduce the @len parameter as much as they > > want. TBH I'm mystified why the original btrfs dedupe ioctl wouldn't > > allow deduplication of common prefixes, which means that len only gets > > shortened to avoid weird problems when dealing with eof being in the > > middle of a block. > > > > (Or not, since there's clearly a bug.) > > > > > > so I do not see how any of the REMAP_FILE_CAN_SHORTEN cases > > > > are valid, regardless of EOF. > > > > > > Not sure about this, it looks to me that they are actually returning > > > the number of bytes deduped, but the value is not used, but maybe I'm > > > missing something. > > > Anyway I think there are valid cases when REMAP_FILE_CAN_SHORTEN makes sense. > > > For example if a source file content is a prefix of a destination file > > > content and we want to dedup the whole range of the source file > > > without REMAP_FILE_CAN_SHORTEN, > > > then the ioctl will only succeed when the end of the source file is at > > > the block boundary, otherwise it will just fail. This will render the > > > API very inconsistent. > > > > <nod> I'll try to figure out where the len adjusting went wrong here and > > report back. > > Hey Darrick, > > Len adjustment happens in generic_remap_checks, specifically here: > if (pos_in + count == size_in && > (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) { > bcount = ALIGN(size_in, bs) - pos_in; > } else { > if (!IS_ALIGNED(count, bs)) > count = ALIGN_DOWN(count, bs); > bcount = count; > } > the problem is that in this particular case length is set to zero, and > it makes generic_remap_file_range_prep to succeed. > I think that 2 things are needed to make this API behave reasonably: > * always use the original length for the file content comparison, not > the truncated one (there currently is also a short circuit for zero > length that skips comparison, this would need to be revisited as well) > - otherwise we may report successful deduplication, despite the ranges > being clearly different and only the truncated ranges being the same. > * report real, possibly truncated deduplication length to the userland I was about to send in a patch when git blame told me that Linus was the last person to change the last line of this chunk: deduped = vfs_dedupe_file_range_one(file, off, dst_file, info->dest_offset, len, REMAP_FILE_CAN_SHORTEN); if (deduped == -EBADE) info->status = FILE_DEDUPE_RANGE_DIFFERS; else if (deduped < 0) info->status = deduped; else info->bytes_deduped = len; And then I remembered that I had totally repressed my entire memory of this unpleasant discussion wherein I got the short end of the stick not because I'd designed the weird btrfs ioctl that eventually became FIDEDUPERANGE but because I was the last person to make any real changes: https://lore.kernel.org/all/a7c93559-4ba1-df2f-7a85-55a143696405@xxxxxxxxxxxxxxx/ Originally, btrfs returned the request length, but also didn't do any checking or operation length reduction to prevent people from remapping a partial EOF block into the middle of a file: https://elixir.bootlin.com/linux/v3.19/source/fs/btrfs/ioctl.c#L3018 Dave and I spent a long time adapting XFS to this interface and hoisted the generic(ish) parts of the btrfs code to the VFS. The VFS support code returned the number of bytes remapped, though at that point either the original request succeeded or it didn't, even if doing so set up data corruption: https://elixir.bootlin.com/linux/v4.5/source/fs/read_write.c#L1655 Miklos went and committed this change containing no description of why it was made. This patch (AFAICT) was never sent to fsdevel and does not seem to have been reviewed by anybody. The patch sets up the current broken behavior where the kernel can dedupe fewer bytes than was asked for, yet return the original request length: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5740c99e9d30b81fcc478797e7215c61e241f44e Five months later I noticed and fixed the problems with remapping partial eof blocks into the middle of files and fixed it, but none of us noticed there was a logic bomb lurking in bytes_deduped: https://lore.kernel.org/linux-fsdevel/153981625504.5568.2708520119290577378.stgit@magnolia/ A different person wrote a test for sub-block dedupe, but even he failed to notice the broken behavior: https://lore.kernel.org/fstests/20181105111445.11870-1-fdmanana@xxxxxxxxxx/ Four years later (last summer), someone produced an attempt to fix this particular weird discrepancy: https://lore.kernel.org/all/b5805118-7e56-3d43-28e9-9e0198ee43f3@xxxxxxxxxxxxxxx/ Which then Dave asked for a revert because it broke generic/517: https://lore.kernel.org/all/20220714223238.GH3600936@xxxxxxxxxxxxxxxxxxx/ Since then, AFAICT there's been no followup to "We just need a bit of time to look at the various major dedupe apps and check that they still do the right thing w.r.t. proposed change." The manpage for FIDEDUPERANGE says that bytes_deduped contains the number of bytes actually remapped, but that doesn't matter because Linus only cares about past kernel behavior. That behavior is now a mess, since every LTS kernel since 4.19 has been over-reporting the number of bytes deduped. Unfortunately, while *I* think the correct behavior is to report bytes remapped even if that quantity becomes zero due to alignment issues, we can't report zero here because duperemove will go into an infinite loop because the author did the usual "len += bytes_deduped" without checking for a zero length. This causes generic/561 to run forever. One could argue that we could return "extents differ" in that case, but then that breaks generic/517 and generic/182 which aren't expecting that behavior either. It's probably just time for us to just burn FIEDEDUPERANGE to the ground and for me to convert another of my other hobbies into my profession. --D > Cheers, > Zbigniew