Re: [PATCH v2 3/4] arm64: fix __raw_copy_to_user semantics

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 22 Mar 2023 09:39:40 -0700

On Wed, Mar 22, 2023 at 8:30 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> I also believe that if we have a misaligned store straddling two pages, and the
> first page is faulting, it the store can do a partial write to the 2nd page,
> which I suspected is not what we want (though maybe that's beningn, if we're
> going to say that clobbering anywhere within the dst buffer is fine).

So I don't think that clobbering anywhere in the write buffer is fine
in general:, since user space may well depend on the whole "kernel
wrote exactly this range" (ie people who implement things like
circular buffers in user space that are filled by the kernel with a
read() system call).

But that's about partial read() results in general, and things like
interruptible reads (or just partial results from a pipe) in
particular.

At the same time EFAULT really is pretty special. We've *tried* to
make it restartable by user space, but honestly, I'm not entirely sure
that was ever a good idea, and I'm not even sure anybody really
depends on it.

For that case, I don't think we necessarily should care too deeply.

HOWEVER.

There is one very very special case: we may not care about
"restartable system calls" from user space, and say "we might as well
just always return EFAULT for any partial result".

That is, after all, what we already do for many cases (ie structure
copies, "put_user()" and friends). And I suspect it's what a lot of
other systems do.

No, the one special case is for the _kernel_ use of restartable user
copies, and the "__copy_from_user_inatomic()" case in particular.

They aren't hugely common, but they are required for some cases
(notably filesystems that hold locks and cannot take user page faults
due to deadlock scenarios), and *those* need very special care.

They still don't need to be byte-exact, but they do need to be able to
restart and most notably they need to be able to make forward progress
(together with a separate "fault_in_user_readable/writable()").

We've had situations where that didn't happen, and then you get actual
kernel hangs when some filesystem just does an endless loop with page
faults.

And by "make forward progress", it's actually fine to write too much
to user space, and return a short return value, and when restarting,
do some of the writes to user space *again*. It just has to be
restartable with the same data, and it can't keep taking a fault
thanks to some bad interaction with the "fault-in" case.

Of course, that does end up using the exact same user copy code, just
under "pagefault_disable()" (which is often implicitly called through
something like "kmap_atomic()" or similar).

So I don't think we need to be byte-exact, but we do need to keep that
restart case in mind. But again, the restart can re-do parts of the
copy.

                    Linus