Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxx> · Wed, 13 Jul 2022 01:16:37 -0700

On Tue, Jul 12, 2022 at 11:46 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> Hence if we restrict EOF block deduping to both the src and dst
> files having matching EOF offsets in their EOF blocks like so:
>
> -       if (pos_in + count == size_in) {
> +       if (pos_in + count == size_in &&
> +           (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) {
>                  bcount = ALIGN(size_in, bs) - pos_in;

I agree with checking the target size too.

And I can see how missing that might cause the problem.

I don't think that is limited to the REMAP_FILE_DEDUP case, though.
Even if you a clone operation, you cannot just clone the EOF block to
some random part of the destination.

Anyway, isn't all of this supposed to be done by
generic_remap_check_len()? That function already takes care of a
similar concern for REMAP_FILE_CAN_SHORTEN, where the size of the
*output* file matters.

So generic_remap_check_len() basically already does one EOF block
check for the output file. It just doesn't do it for the input side.

And currently generic_remap_check_len() is done too late for
REMAP_FILE_DEDUP, which did its handling just before calling it.

So while I agree with your patch from a "this seems to be the
underlying bug", I think the fix should be to move this "both EOF
blocks have to match" logic to generic_remap_check_len(), and just do
that *before* that

        if (remap_flags & REMAP_FILE_DEDUP) {

in generic_remap_file_range_prep().

No?

That said, the rest of that code in generic_remap_checks() still makes
little to no sense to me.

Look:

>                  bcount = ALIGN(size_in, bs) - pos_in;

and we literally *just* checked that "pos_in + count == size_in".

So we can write that as

>                  bcount = ALIGN(pos_in + count, bs) - pos_in;

That doesn't look simpler, but...

Again, just a few lines above this all, we had

>         if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs))
>                 return -EINVAL;

so we know that 'pos_in' is aligned wrt bs.

So we can rewrite that "ALIGN(pos_in + count, bs)" as "pos_in +
ALIGN(count, bs)", because 'pos_in' doesn't change anything wrt an
alignment operation.

And then trivial simplification ("pos_in - pos_in goes away") makes
the whole expression be just

>                  bcount = ALIGN(count, bs);

which just once more makes me go "maybe this code works, but it is
clearly written for maximum nonsensical value".

The "else" side is equally overly complex too, and does

                if (!IS_ALIGNED(count, bs))
                        count = ALIGN_DOWN(count, bs);
                bcount = count;

which is just a really complicated way to write

                bcount = ALIGN_DOWN(count, bs);
                count = bcount;

so that side if the if-statement knew that it could just align the
count directly, but decided to do that in the least obvious way too.

If 'count' was already aligned, ALIGN_DOWN() does nothing. And masking
is much cheaper than testing and branching.

Not to mention just *simpler*: One case aligns up to the next block
boundary ("include the shared EOF block"), the other case aligns down
("only try to merge full blocks")./

Now the code makes sense, although it's still somewhat subtle in that
the align-down case will also update 'count' (which is returned),
while the EOF code will only set 'bcount' (which is only used for the
overlapping range check) .

But then, when you look at that and understand what's going on, that
in turn then makes *another* thing obvious: the whole existence of
'bcount' is entirely pointless.

Because 'bcount' is only used for that range check, and for the
non-EOF case it's the same as 'count'.

And for the EOF case, doing that alignment is entirely pointless,
since if the in/out inodes are the same, then the file size is going
to be the same, and the EOF block is going to overlap whether bcount
was aligned to block boundary or not.

So the EOF case might as well just have made 'bcount = count' without
any alignment to the next block boundary at all.

And once it does that, now bcount is _always_ the same as count, and
there is no point in having bcount at all.

So after doing all the above simplification, you can then get rid of
'bcount' entirely.

> So, yeah, I think arguing about permissions and API and all that
> stuff is just going completely down the wrong track because it
> doesn't actually address the root cause of the information leak....

I agree that getting this "check the right range" thing right is the
prime thing.

The code being *very* hard to follow and not having any obvious rules
really does not help, though. The permission checks are odd. And the
range checks were odd and inscrutable and buggy to boot.

Yes, I require that people don't break user space. That's the #1 rule
of kernel development.

But that does not mean or imply "write incomprehensible code".

And honestly, I think your suggested patch just makes incomprehensibly
and pointlessly complex code even more so.

Which is why I'm suggesting the real fix is to clean it up, and mover
that EOF offset check to generic_remap_check_len() where it belongs,
and where we already have that comment:

 * ... For deduplication we accept a partial EOF block only if it ends at the
 * destination file's EOF (can not link it into the middle of a file).

but that comment doesn't actually match the code in that function.

In fact, that comment currently doesn't match any existing code at all
(your suggested patch would be that code, but in another place
entirely).

Can we please all agree that this code is too obscure for its own good?

                     Linus