On Tue, Jul 12, 2022 at 07:58:11PM -0700, Linus Torvalds wrote: > On Tue, Jul 12, 2022 at 5:48 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > > > AFAICT, the reason why dedupe does all the weird checks it does is > > because apparently some of the dedupe tools expect that they can do > > weird things like dedupe into a file that they own but haven't opened > > for writes or could open for writes. Change those bits, and you break > > userspace. > > Christ, what a crock. Yes, the dedupe API is ... poor. But we're stuck with it because someone called Linus Torvalds who keeps telling us that we should not break userspace if there's *any way possible* to avoid doing so.... .... and in this case, I think the problem is avoided by a simple fix to generic_remap_checks(). > > The dedupe implementations /do/ check file blocksize, it's under > > generic_remap_file_range_prep. The reason this exploit program works on > > the 7-byte file is that we stop comparing at EOF, which means that there > > are fewer bytes to guess. But. You can already read the file anyway. > > The dedupe clearly does *not* check for blocksize. It doesn't even > call generic_remap_file_range_prep(). > > Just check it yourself: I did, hours ago, and concluded that you were on the wrong track and I didn't care enough to get involved in a shouting match. But you're not actually understanding the code, nor are you listening to the people who are trying to point out that you haven't understood how the code works. You're causing those people stress and angst by shouting them down even though you are wrong - these people know the code far better than you, so you need to listen to them rather than shout over them. So, let's clear all that away, and just follow the code. I'll map out the path to block alignment for you: > do_vfs_ioctl -> > case FIDEDUPERANGE: > ioctl_file_dedupe_range(filp, argp) -> > vfs_dedupe_file_range(file, same) -> There's no checks at this point because we can't do them safely. We have to first lock the inodes before we check against EOF so that dedupe does not race against truncate, fallocate, or extending writes. This is important because we have to support the "dedupe whole file" case that is defined by src = {0, EOF}, dst = {0, EOF}, and that's where all the complexity lies. Hence we have to continue down the dedupe stack to lock the inodes and perform the block alignment checks: vfs_dedupe_file_range_one() file->f_op->remap_file_range() xfs_file_remap_range() <locks inodes> <performs XFS specific remap checks> generic_remap_file_range_prep() generic_remap_checks() generic_remap_checks() does: ..... loff_t bs = inode_out->i_sb->s_blocksize; int ret; /* The start of both ranges must be aligned to an fs block. */ if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs)) return -EINVAL; ..... /* * If the user wanted us to link to the infile's EOF, round up to the * next block boundary for this check. * * Otherwise, make sure the count is also block-aligned, having * already confirmed the starting offsets' block alignment. */ if (pos_in + count == size_in) { bcount = ALIGN(size_in, bs) - pos_in; } else { if (!IS_ALIGNED(count, bs)) count = ALIGN_DOWN(count, bs); bcount = count; } It should be clear that this code does not allow unaligned file offsets at all. It should also be clear that we silently round the dedupe length down to block size alignment to avoid sub-block dedupe attempts (not possible) rather than erroring out because it's always valid to dedupe less range than was asked for. However, we also have a special case for the "dedupe to EOF" operation that allows whole file dedupe. In that case, we round the length *up* to capture the whole EOF block in the remap attempt. That's the case where bug the test case exploits lies - it fails to check whether the dst range is also deduping all the way to EOF. We haven't got to the data match checks yet - we're still deciding on the bounds for the data match checks at this point. Hence if we get the bound checks wrong here, the data checks might match when they shouldn't. e.g. by only checking for partial block matches instead of checking all the valid data in both src and dst match. That's the bug in this code - the dedupe length EOF rounding needs to be more constrained as it's allowing an EOF block to be partially matched against any full filesystem block as long as the range from start to EOF in the block matches. That's the information leak right there, and it has nothing to do with permissions. Hence if we restrict EOF block deduping to both the src and dst files having matching EOF offsets in their EOF blocks like so: - if (pos_in + count == size_in) { + if (pos_in + count == size_in && + (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) { bcount = ALIGN(size_in, bs) - pos_in; } else { if (!IS_ALIGNED(count, bs)) count = ALIGN_DOWN(count, bs); bcount = count; } This should fix the bug that was reported as it prevents dedupe an EOF block against non-EOF blocks or even EOF blocks where EOF is at a different offset into the EOF block. So, yeah, I think arguing about permissions and API and all that stuff is just going completely down the wrong track because it doesn't actually address the root cause of the information leak.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx