On Wed, Mar 22, 2023 at 8:30 AM Mark Rutland <mark.rutland@xxxxxxx> wrote: > > I also believe that if we have a misaligned store straddling two pages, and the > first page is faulting, it the store can do a partial write to the 2nd page, > which I suspected is not what we want (though maybe that's beningn, if we're > going to say that clobbering anywhere within the dst buffer is fine). So I don't think that clobbering anywhere in the write buffer is fine in general:, since user space may well depend on the whole "kernel wrote exactly this range" (ie people who implement things like circular buffers in user space that are filled by the kernel with a read() system call). But that's about partial read() results in general, and things like interruptible reads (or just partial results from a pipe) in particular. At the same time EFAULT really is pretty special. We've *tried* to make it restartable by user space, but honestly, I'm not entirely sure that was ever a good idea, and I'm not even sure anybody really depends on it. For that case, I don't think we necessarily should care too deeply. HOWEVER. There is one very very special case: we may not care about "restartable system calls" from user space, and say "we might as well just always return EFAULT for any partial result". That is, after all, what we already do for many cases (ie structure copies, "put_user()" and friends). And I suspect it's what a lot of other systems do. No, the one special case is for the _kernel_ use of restartable user copies, and the "__copy_from_user_inatomic()" case in particular. They aren't hugely common, but they are required for some cases (notably filesystems that hold locks and cannot take user page faults due to deadlock scenarios), and *those* need very special care. They still don't need to be byte-exact, but they do need to be able to restart and most notably they need to be able to make forward progress (together with a separate "fault_in_user_readable/writable()"). We've had situations where that didn't happen, and then you get actual kernel hangs when some filesystem just does an endless loop with page faults. And by "make forward progress", it's actually fine to write too much to user space, and return a short return value, and when restarting, do some of the writes to user space *again*. It just has to be restartable with the same data, and it can't keep taking a fault thanks to some bad interaction with the "fault-in" case. Of course, that does end up using the exact same user copy code, just under "pagefault_disable()" (which is often implicitly called through something like "kmap_atomic()" or similar). So I don't think we need to be byte-exact, but we do need to keep that restart case in mind. But again, the restart can re-do parts of the copy. Linus