On Sun, Apr 2, 2023 at 3:22 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > > Linus, are you going to turn this into a proper patch? This is too > good to not pursue. So I initially was planning on doing just that for 6.4. However, I looked more at it, and I'm now fairly convinced that the biggest problem is that we have basically screwed up our simple "copy_to/from_user()" with all the indirection. So yes, that patch may end up being a 6% improvement on the made-up load, but a *large* reason for that is that we just do horribly badly on a plain "copy_from_user()", and I think I could fix that. The problems with "copy_from_user()" are: - it's *two* function calls ("_copy_to_user()" and then "raw_copy_to_user()") - we have entirely lost all "this is how big the copy is" at all levels and my stupid patch basically improves on copy_iovec_from_user() by fixing those two issues: - it inlines it all by using the "unsafe_get_user()" interfaces - it recovers the access sizes by just accessing the fields directly and in the process it gets rid of us being really really piggy in "copy_from_user()". Now, there are a few reason *why* we are so piggy in copy_from_user(), mainly (a) CLAC/STAC is just slow and 'access_ok()' is big (b) we have lots of debug boiler plate crap. Things like - might_fault() (PROVE_LOCKING || DEBUG_ATOMIC_SLEEP) - check_copy_size() (HARDENED_USERCOPY) - test should_fail_usercopy() (FAULT_INJECTION_USERCOPY) - do instrument_copy_from_user_before/after (KASAN) - clear the end of the buffer on failures (legacy behavior) (c) you probably don't have a CPU with X86_FEATURE_FSRM Now, what (a) and (b) does is to make it unreasonable to inline copy_from/to_user(). Particularly when those debug options are set, but even without them, it's just not reasonable to inline. Long long ago, we used to special-case small constant-sized user copies, but all of that has gone away. That's lovely (it used to be horrid in the header files, and caused problems) And (c) means that the small cases tend to suck, although with all the other overhead (two function calls, possibly with return stack counting, CLAC/STAC, 'lfence' for speculation control etc etc), that's almost not an issue. It's just extra cycles. Again, that hack to copy_iovec_from_user() ends up working so well simply because it avoids all the *stupid* stuff. Yes, it still obviously does the CLAC/STAC and the access_ok(), but it's ok to inline when it's just that special code, not some random 'copy_from_user()' that doesn't matter. Anyway, what this long rant is about is that I'm looking at what I can do to improve "copy_from_user()" instead. It's a pain, because of all those debug options, but I actually have a disgusting patch that would make it possible to have a much better copy_from_user. How disgusting, you say? Let me quote part of it: +#define __a(n,a) ((unsigned long)(n)&((a)-1)) +#define statically_aligned(n,a) \ + (__builtin_constant_p(__a(n,a)) && !__a(n,a)) + +extern unsigned long copy_from_user_a16(void *to, const void __user *from, unsigned long n); +extern unsigned long copy_from_user_a8(void *to, const void __user *from, unsigned long n); +extern unsigned long copy_from_user_a4(void *to, const void __user *from, unsigned long n); ... + if (statically_aligned(n, 16)) + n = copy_from_user_a16(to, from, n); + else if (statically_aligned(n, 8)) + n = copy_from_user_a8(to, from, n); + else if (statically_aligned(n, 4)) + n = copy_from_user_a4(to, from, n); + else + n = _copy_from_user(to, from, n); and it turns out that both gcc and clang are smart enough that even when you don't have a *constant* size, when you have code like if (copy_from_user(iov, uvec, nr_segs * sizeof(*uvec))) return -EFAULT; that "statically_aligned()" thing will notice that "look, that size is a multiple of 16", and my disgusting patch replaces the "I have lost all sign of the size of the copy" user copy with a call to copy_from_user_a16(). And interestingly, that "size is 16-byte aligned" happens a *lot*. It happens for that uvec case, but it also happens for things like "cp_new_stat()" that copies a "struct stat" to user space, because 'struct stat' is something like 144 bytes (9*16) on x86-64. So yes, I can improve the iovec copy. Easy peasy. Get rid of the crazy overhead from a generic interface and from our disgusting debugging code. Or I could try to improve copy_to/from_user() in general. My current patch is too ugly to actually live, and making it even better (encode the size range when it is statically known, not just a few "size is X-byte aligned") would make it uglier still. I'm idly thinking about trying to teach 'objtool' to pick the right function target by hiding the size information in a separate section. I may decide that "just doing iovecs" is just so much simpler, and that trying to improve the generic case is too painful. But that's the issue right now - I know I *could* just improve copy_from_user() enough that at least with sane configurations, that iovec improvement would basically not be worth it. The problem here is at least partly "sane configurations". At least Fedora enables HARDENED_USERCOPY. And if you have that enabled, then "copy_to/from_user()" is _always_ going to suck, and doing a special "copy iovecs by avoiding it" is always going to be better, if only because you avoid the debug overhead. Argh. Linus